From 603c6f33b62295a0cd94c1c51a0410c50a8d047c Mon Sep 17 00:00:00 2001
From: Ayush Dattagupta <ayushdg95@gmail.com>
Date: Wed, 13 May 2026 14:05:31 -0700
Subject: [PATCH 1/2] Bulk updates for the 1.2.0 release

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
---
 docs/nemotron/data/curation/nemotron-cc.md    |   6 +-
 .../data/curation/nemotron-cc/README.md       |  16 +-
 .../nemotron-cc/step_1-download_extract.py    |   2 +-
 .../nemotron-cc/step_2a-exact_dedup.py        |   2 +-
 .../nemotron-cc/step_2b-fuzzy_dedup.py        |   2 +-
 .../step_2c-substring_dedup/README.md         |   2 +-
 .../prepare_dataset.py                        |   3 +-
 .../step_3-quality_classification.py          |   2 +-
 .../data/curation/nemotron-cc/step_4-sdg.py   | 293 +++++++++++++-----
 9 files changed, 238 insertions(+), 90 deletions(-)

diff --git a/docs/nemotron/data/curation/nemotron-cc.md b/docs/nemotron/data/curation/nemotron-cc.md
index 77cd4e024..ac9bba4cd 100644
--- a/docs/nemotron/data/curation/nemotron-cc.md
+++ b/docs/nemotron/data/curation/nemotron-cc.md
@@ -19,7 +19,7 @@ Common Crawl → Extract & Clean → Deduplicate → Quality Classify → Synthe
 | 2b | `step_2b-fuzzy_dedup.py` | MinHash + LSH fuzzy deduplication | GPU (identify), CPU (remove) |
 | 2c | `step_2c-substring_dedup/` | Exact substring deduplication using suffix arrays | CPU-only |
 | 3 | `step_3-quality_classification.py` | Ensemble quality scoring into 20 buckets | GPU (classify), CPU (ensemble) |
-| 4 | `step_4-sdg.py` | LLM-based synthetic data generation on top-quality data | CPU + LLM endpoint |
+| 4 | `step_4-sdg.py` | LLM-based synthetic data generation on top-quality data | GPU (local inference server, default) or CPU + external LLM endpoint (with `--no-serve-model`) |
 
 Steps 1–3 progressively filter and annotate the data. Step 4 generates synthetic training data (diverse QA, distillation, knowledge extraction, knowledge lists) from the highest-quality documents (buckets 18–19).
 
@@ -35,9 +35,9 @@ See the recipe README at `src/nemotron/recipes/data/curation/nemotron-cc/README.
 
 ## Prerequisites
 
-- [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) installed with Ray support
+- [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) 1.2.0 (26.04 release) or newer, installed with Ray support
 - GPU(s) for steps 2a, 2b, and 3 (deduplication and classification)
-- Access to an OpenAI-compatible LLM endpoint for step 4 (NVIDIA NIM, vLLM, or cloud API)
+- For step 4, one of: GPU(s) to host a local inference server (default), an OpenAI-compatible endpoint (self-hosted vLLM/NIM or cloud, via `--no-serve-model`), or an [NVIDIA Build](https://build.nvidia.com/) API key
 
 ## After Curation
 
diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/README.md b/src/nemotron/recipes/data/curation/nemotron-cc/README.md
index 212702b34..64a8c49f5 100644
--- a/src/nemotron/recipes/data/curation/nemotron-cc/README.md
+++ b/src/nemotron/recipes/data/curation/nemotron-cc/README.md
@@ -4,10 +4,10 @@ This directory contains the recipe for curating datasets similar to the [Nemotro
 
 ### Requirements
 
-- [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) > 1.1.0 ([install from main](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html))
+- [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) 1.2.0 (26.04 release) or newer ([install instructions](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html))
 - GPU(s) for steps 2a, 2b, and 3 (deduplication and classification)
 - [Cargo/Rust](https://doc.rust-lang.org/cargo/getting-started/installation.html) for step 2c (building `deduplicate-text-datasets`)
-- Access to an OpenAI-compatible LLM endpoint for step 4 (NVIDIA NIM, vLLM, or cloud API)
+- For step 4, one of: GPU(s) to host a local inference server (default), an OpenAI-compatible endpoint (self-hosted vLLM/NIM or cloud), or an [NVIDIA Build](https://build.nvidia.com/) API key
 
 ### Pipeline Overview
 
@@ -65,7 +65,7 @@ Ensemble quality scoring and bucketing into 20 quality tiers:
 
 #### Step 4: Synthetic Data Generation (`step_4-sdg.py`)
 
-LLM-based synthetic data generation on the highest-quality documents (buckets 18 and 19). This is a CPU-only pipeline — LLM inference happens via API calls to an external endpoint (NVIDIA Integrate, or a self-hosted OAI endpoint compatible server).
+LLM-based synthetic data generation on the highest-quality documents (buckets 18 and 19).
 
 Four generation tasks:
 
@@ -78,6 +78,12 @@ Four generation tasks:
 
 Each task runs as an independent pipeline (preprocessing, LLM generation, postprocessing, write). When `--task all` is used, the four tasks run sequentially. They can also be run as separate processes in parallel.
 
-- **Default model:** [`Qwen/Qwen3-30B-A3B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507). This model is not available on [NVIDIA Build](https://build.nvidia.com/), so you'll need to provide a `--base-url` pointing to an endpoint serving it (self-hosted via vLLM/NIM, or any OpenAI-compatible cloud provider). Alternatively, you can use any model available on NVIDIA Build by setting `--model-name` and `--tokenizer` accordingly.
+**LLM backends.** The script supports three ways to reach the model — pick one:
+
+1. **Local inference server (default).** The script spins up a Ray Serve + vLLM deployment of `--model-name` on the local GPU cluster. No API key needed; scales out across replicas. Tune with `--tensor-parallel-size`, `--min-replicas`, `--max-replicas`. If GPU utilization is low, increase `--max-concurrent-requests` (try 256–512). Pass `--no-serve-model` to use one of the external options below instead.
+2. **Existing OpenAI-compatible endpoint.** Pass `--no-serve-model --base-url <url>` to use a self-hosted vLLM/TRT-LLM/NIM server (or any OpenAI-compatible cloud provider). `--api-key` is forwarded if needed.
+3. **[NVIDIA Build](https://build.nvidia.com/).** Pass `--no-serve-model` to use the default `--base-url`. Requires `--api-key` (or `NVIDIA_API_KEY` env var). The default `--model-name` (`Qwen/Qwen3-30B-A3B-Instruct-2507`) is not on NVIDIA Build, so set `--model-name` (and `--tokenizer`) to a model that is.
+
+- **Default model:** [`Qwen/Qwen3-30B-A3B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507).
 - **Output:** `data/sdg_output/<task_name>/`.
-- **Resources:** CPU-only for the script itself. Requires access to an LLM endpoint.
+- **Resources:** With `--serve-model`, GPU(s) for vLLM. Otherwise CPU-only; just needs network access to the chosen endpoint.
diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_1-download_extract.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_1-download_extract.py
index 348a0e86a..d1ea9ba76 100644
--- a/src/nemotron/recipes/data/curation/nemotron-cc/step_1-download_extract.py
+++ b/src/nemotron/recipes/data/curation/nemotron-cc/step_1-download_extract.py
@@ -29,7 +29,7 @@
 from fsspec.core import url_to_fs
 from loguru import logger
 
-from nemo_curator.backends.experimental.ray_data import RayDataExecutor
+from nemo_curator.backends.ray_data import RayDataExecutor
 from nemo_curator.core.client import RayClient
 from nemo_curator.pipeline import Pipeline
 from nemo_curator.stages.base import ProcessingStage
diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_2a-exact_dedup.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_2a-exact_dedup.py
index c754f649c..2656e7b61 100644
--- a/src/nemotron/recipes/data/curation/nemotron-cc/step_2a-exact_dedup.py
+++ b/src/nemotron/recipes/data/curation/nemotron-cc/step_2a-exact_dedup.py
@@ -37,7 +37,7 @@
 
 from nemo_curator.stages.file_partitioning import FilePartitioningStage
 from nemo_curator.tasks import EmptyTask
-from nemo_curator.backends.experimental.ray_data import RayDataExecutor
+from nemo_curator.backends.ray_data import RayDataExecutor
 from nemo_curator.core.client import RayClient
 from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
 from nemo_curator.stages.deduplication.id_generator import CURATOR_DEDUP_ID_STR
diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_2b-fuzzy_dedup.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_2b-fuzzy_dedup.py
index 82b2ea5f5..d14b04b5b 100644
--- a/src/nemotron/recipes/data/curation/nemotron-cc/step_2b-fuzzy_dedup.py
+++ b/src/nemotron/recipes/data/curation/nemotron-cc/step_2b-fuzzy_dedup.py
@@ -36,7 +36,7 @@
 
 from loguru import logger
 
-from nemo_curator.backends.experimental.ray_data import RayDataExecutor
+from nemo_curator.backends.ray_data import RayDataExecutor
 from nemo_curator.core.client import RayClient
 from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
 from nemo_curator.stages.deduplication.id_generator import CURATOR_DEDUP_ID_STR
diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/README.md b/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/README.md
index 82945b43c..7fe68072a 100644
--- a/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/README.md
+++ b/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/README.md
@@ -13,7 +13,7 @@ If you are adapting it for Slurm:
 3. The `remove_duplicates` step must run on a single exclusive node.
 
 Dependencies:
-- `nemo_curator>=1.1.0`
+- `nemo_curator>=1.2.0` (26.04 release)
 - `cargo` (for building `deduplicate-text-datasets`). See https://doc.rust-lang.org/cargo/getting-started/installation.html
 
 We recommend splitting up the dataset into 100 GB chunks or less, and executing `exact_substring_dedup.sh` on each 100 GB chunk.
diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/prepare_dataset.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/prepare_dataset.py
index 24515dcf2..d4c2d0b61 100644
--- a/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/prepare_dataset.py
+++ b/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/prepare_dataset.py
@@ -21,7 +21,7 @@
 import tiktoken
 
 from nemo_curator.backends.base import WorkerMetadata
-from nemo_curator.backends.experimental.ray_data import RayDataExecutor
+from nemo_curator.backends.ray_data import RayDataExecutor
 from nemo_curator.core.client import RayClient
 from nemo_curator.pipeline import Pipeline
 from nemo_curator.stages.base import ProcessingStage
@@ -91,6 +91,7 @@ def write_id_to_filename(input_file_paths: list[str], output_path: str) -> dict[
     input_file_list = [os.path.basename(filename) for filename in input_file_paths]
     id_to_filename = {str(i): filename for i, filename in enumerate(input_file_list)}
 
+    os.makedirs(output_path, exist_ok=True)
     with open(f"{output_path}/id_to_filename.json", "w") as fp:
         json.dump(id_to_filename, fp)
 
diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_3-quality_classification.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_3-quality_classification.py
index 32e6a2151..6efa1e798 100644
--- a/src/nemotron/recipes/data/curation/nemotron-cc/step_3-quality_classification.py
+++ b/src/nemotron/recipes/data/curation/nemotron-cc/step_3-quality_classification.py
@@ -41,7 +41,7 @@
 import pandas as pd
 from loguru import logger
 
-from nemo_curator.backends.experimental.ray_data import RayDataExecutor
+from nemo_curator.backends.ray_data import RayDataExecutor
 from nemo_curator.core.client import RayClient
 from nemo_curator.pipeline import Pipeline
 from nemo_curator.stages.base import ProcessingStage
diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_4-sdg.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_4-sdg.py
index 85eea7e42..939fc46ac 100644
--- a/src/nemotron/recipes/data/curation/nemotron-cc/step_4-sdg.py
+++ b/src/nemotron/recipes/data/curation/nemotron-cc/step_4-sdg.py
@@ -30,25 +30,45 @@
 to an LLM for generation, postprocesses the results, and writes output to
 parquet or JSONL.
 
-This is a CPU-only pipeline — LLM inference happens via API calls to an
-external endpoint (NVIDIA Integrate, or a self-hosted vLLM/TRT-LLM server).
+LLM backends (pick one):
+
+  1. Local inference server (default).
+     The script spins up a Ray Serve + vLLM deployment of --model-name on
+     the local cluster and routes generation through it. No API key
+     required, and it scales out across replicas. Pass --no-serve-model
+     to opt out and use one of the external options below.
+
+  2. Existing OpenAI-compatible endpoint.
+     Pass --no-serve-model --base-url <url> for a self-hosted vLLM/TRT-LLM/
+     NIM server (or any OpenAI-compatible cloud provider). --api-key is
+     forwarded if set.
+
+  3. NVIDIA Build (build.nvidia.com).
+     Pass --no-serve-model to use the default --base-url. Requires
+     --api-key (or NVIDIA_API_KEY env var). Note: the default --model-name
+     (Qwen3-30B-A3B-Instruct-2507) is not hosted on NVIDIA Build — pass
+     --model-name to a model that is.
 
 Usage:
-    # Run all SDG tasks on bucket 18/19 data
+    # Default: stand up a local inference server (4 GPUs, 4 replicas).
+    # Bump --max-concurrent-requests if GPU utilization is low.
     python step_4-sdg.py \\
         --task all \\
-        --tokenizer Qwen/Qwen3-30B-A3B-Instruct-2507
+        --tensor-parallel-size 1
 
-    # Run a single task
+    # Hit an existing self-hosted OpenAI-compatible endpoint
     python step_4-sdg.py \\
-        --task diverse_qa \\
+        --task all \\
+        --no-serve-model \\
+        --base-url http://localhost:8000/v1 \\
         --tokenizer Qwen/Qwen3-30B-A3B-Instruct-2507
 
-    # Use a local/self-hosted endpoint
+    # Hit NVIDIA Build (set NVIDIA_API_KEY in env or pass --api-key)
     python step_4-sdg.py \\
-        --task distill \\
-        --base-url http://localhost:8000/v1 \\
-        --tokenizer Qwen/Qwen3-30B-A3B-Instruct-2507
+        --task all \\
+        --no-serve-model \\
+        --model-name meta/llama-3.3-70b-instruct \\
+        --tokenizer meta-llama/Llama-3.3-70B-Instruct
 
 See README.md in this directory for detailed usage instructions.
 """
@@ -61,7 +81,7 @@
 from loguru import logger
 from transformers import AutoTokenizer
 
-from nemo_curator.backends.experimental.ray_data import RayDataExecutor
+from nemo_curator.backends.ray_data import RayDataExecutor
 from nemo_curator.backends.xenna import XennaExecutor
 from nemo_curator.core.client import RayClient
 from nemo_curator.models.client.llm_client import GenerationConfig
@@ -524,79 +544,140 @@ def run_task(
     return metrics
 
 
-def main(args: argparse.Namespace) -> None:
-    ray_client = RayClient(num_cpus=args.num_cpus, include_dashboard=False)
-    ray_client.start()
-
-    tokenizer_name = args.tokenizer if args.tokenizer else args.model_name
-    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
-    args.hf_token = os.environ.get("HF_TOKEN", "")
-
-    api_key = args.api_key
-    if not api_key:
-        msg = (
-            "API key is required. Set NVIDIA_API_KEY environment variable or use --api-key. "
-            "Get your API key from https://build.nvidia.com/settings/api-keys"
-        )
-        raise ValueError(msg)
+def _start_inference_server(args: argparse.Namespace):
+    """Start a local Ray Serve inference server for the model.
 
-    llm_client = AsyncOpenAIClient(
-        api_key=api_key,
-        base_url=args.base_url,
-        max_concurrent_requests=args.max_concurrent_requests,
-        max_retries=args.max_retries,
-        base_delay=args.base_delay,
-        timeout=args.timeout,
+    Returns the InferenceServer instance.
+    """
+    from nemo_curator.backends.utils import get_available_cpu_gpu_resources
+    from nemo_curator.core.serve import InferenceModelConfig, InferenceServer
+
+    _, num_gpus = get_available_cpu_gpu_resources()
+    num_gpus = int(num_gpus)
+
+    tp_size = args.tensor_parallel_size if args.tensor_parallel_size is not None else num_gpus
+    default_replicas = max(num_gpus // tp_size, 1)
+    min_replicas = args.min_replicas if args.min_replicas is not None else default_replicas
+    max_replicas = args.max_replicas if args.max_replicas is not None else default_replicas
+    logger.info(
+        f"Starting local inference server with tensor_parallel_size={tp_size}, "
+        f"min_replicas={min_replicas}, max_replicas={max_replicas}"
     )
 
-    generation_config = GenerationConfig(
-        temperature=args.temperature if args.temperature is not None else GENERATION_DEFAULTS["temperature"],
-        top_p=args.top_p if args.top_p is not None else GENERATION_DEFAULTS["top_p"],
-        top_k=args.top_k if args.top_k is not None else GENERATION_DEFAULTS["top_k"],
-        max_tokens=args.max_tokens if args.max_tokens is not None else GENERATION_DEFAULTS["max_output_tokens"],
-        stop=args.end_strings if args.end_strings is not None else GENERATION_DEFAULTS["end_strings"],
-        seed=args.seed,
+    server_config = InferenceModelConfig(
+        model_identifier=args.model_name,
+        deployment_config={
+            "autoscaling_config": {
+                "min_replicas": min_replicas,
+                "max_replicas": max_replicas,
+            },
+        },
+        engine_kwargs={
+            "tensor_parallel_size": tp_size,
+            "max_model_len": args.max_model_len,
+            "gpu_memory_utilization": args.gpu_memory_utilization,
+        },
     )
 
-    if args.task == "all":
-        tasks = list(TASK_CONFIGS.keys())
-    else:
-        tasks = [args.task]
-
-    os.makedirs(args.output_dir, exist_ok=True)
-
-    logger.info("Nemotron-CC SDG Pipeline")
-    logger.info(f"  Input:     {args.input_dir}")
-    logger.info(f"  Output:    {args.output_dir}")
-    logger.info(f"  Buckets:   {HIGH_QUALITY_BUCKETS}")
-    logger.info(f"  Tasks:     {tasks}")
-    logger.info(f"  Model:     {args.model_name}")
-    logger.info(f"  Tokenizer: {tokenizer_name}")
-
-    all_metrics = {}
-    total_start = time.perf_counter()
-
-    for task_name in tasks:
-        task_metrics = run_task(
-            task_name=task_name,
-            args=args,
-            llm_client=llm_client,
-            generation_config=generation_config,
-            tokenizer=tokenizer,
+    server = InferenceServer(models=[server_config])
+    server.start()
+    logger.info(f"Local inference server ready at {server.endpoint}")
+    return server
+
+
+def main(args: argparse.Namespace) -> None:
+    inference_server = None
+
+    try:
+        tokenizer_name = args.tokenizer if args.tokenizer else args.model_name
+        tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
+        args.hf_token = os.environ.get("HF_TOKEN", "")
+
+        # Start the Ray cluster FIRST — InferenceServer requires an existing
+        # cluster to deploy Serve actors onto.
+        ray_client = RayClient(num_cpus=args.num_cpus, include_dashboard=False)
+        ray_client.start()
+
+        if args.serve_model:
+            import ray
+            if not ray.is_initialized():
+                ray.init(address="auto", ignore_reinit_error=True)
+            inference_server = _start_inference_server(args)
+            base_url = inference_server.endpoint
+            api_key = "unused"
+        else:
+            base_url = args.base_url
+            api_key = args.api_key
+            if not api_key:
+                msg = (
+                    "API key is required. Set NVIDIA_API_KEY environment variable or use --api-key. "
+                    "Get your API key from https://build.nvidia.com/settings/api-keys"
+                )
+                raise ValueError(msg)
+
+        llm_client = AsyncOpenAIClient(
+            api_key=api_key,
+            base_url=base_url,
+            max_concurrent_requests=args.max_concurrent_requests,
+            max_retries=args.max_retries,
+            base_delay=args.base_delay,
+            timeout=args.timeout,
         )
-        all_metrics[task_name] = task_metrics
-        _save_metrics(
-            task_metrics,
-            os.path.join(args.output_dir, f"{task_name}_metrics.json"),
+
+        generation_config = GenerationConfig(
+            temperature=args.temperature if args.temperature is not None else GENERATION_DEFAULTS["temperature"],
+            top_p=args.top_p if args.top_p is not None else GENERATION_DEFAULTS["top_p"],
+            top_k=args.top_k if args.top_k is not None else GENERATION_DEFAULTS["top_k"],
+            max_tokens=args.max_tokens if args.max_tokens is not None else GENERATION_DEFAULTS["max_output_tokens"],
+            stop=args.end_strings if args.end_strings is not None else GENERATION_DEFAULTS["end_strings"],
+            seed=args.seed,
         )
 
-    total_elapsed = time.perf_counter() - total_start
-    all_metrics["total_elapsed_s"] = round(total_elapsed, 2)
-    _save_metrics(all_metrics, os.path.join(args.output_dir, "sdg_metrics.json"))
+        if args.task == "all":
+            tasks = list(TASK_CONFIGS.keys())
+        else:
+            tasks = [args.task]
+
+        os.makedirs(args.output_dir, exist_ok=True)
+
+        logger.info("Nemotron-CC SDG Pipeline")
+        logger.info(f"  Input:     {args.input_dir}")
+        logger.info(f"  Output:    {args.output_dir}")
+        logger.info(f"  Buckets:   {HIGH_QUALITY_BUCKETS}")
+        logger.info(f"  Tasks:     {tasks}")
+        logger.info(f"  Model:     {args.model_name}")
+        logger.info(f"  Tokenizer: {tokenizer_name}")
+        logger.info(f"  Endpoint:  {base_url}")
+        if args.serve_model:
+            logger.info("  Mode:      local inference server")
+
+        all_metrics = {}
+        total_start = time.perf_counter()
+
+        for task_name in tasks:
+            task_metrics = run_task(
+                task_name=task_name,
+                args=args,
+                llm_client=llm_client,
+                generation_config=generation_config,
+                tokenizer=tokenizer,
+            )
+            all_metrics[task_name] = task_metrics
+            _save_metrics(
+                task_metrics,
+                os.path.join(args.output_dir, f"{task_name}_metrics.json"),
+            )
+
+        total_elapsed = time.perf_counter() - total_start
+        all_metrics["total_elapsed_s"] = round(total_elapsed, 2)
+        _save_metrics(all_metrics, os.path.join(args.output_dir, "sdg_metrics.json"))
 
-    logger.info(f"All SDG tasks completed in {total_elapsed:.1f}s ({total_elapsed / 60:.1f}m)")
+        logger.info(f"All SDG tasks completed in {total_elapsed:.1f}s ({total_elapsed / 60:.1f}m)")
 
-    ray_client.stop()
+    finally:
+        if inference_server is not None:
+            inference_server.stop()
+        ray_client.stop()
 
 
 def attach_args() -> argparse.ArgumentParser:
@@ -651,6 +732,62 @@ def attach_args() -> argparse.ArgumentParser:
         help="Number of CPUs for the local Ray cluster (default: all available).",
     )
 
+    parser.add_argument(
+        "--serve-model",
+        action=argparse.BooleanOptionalAction,
+        default=True,
+        help=(
+            "Start a local Ray Serve inference server for the model instead of "
+            "calling an external API (default: True). Requires GPUs and the model "
+            "weights to be accessible (downloaded automatically from HuggingFace). "
+            "When enabled, --base-url and --api-key are ignored. Pass "
+            "--no-serve-model to disable and use --base-url instead."
+        ),
+    )
+    parser.add_argument(
+        "--tensor-parallel-size",
+        type=int,
+        default=None,
+        help=(
+            "Number of GPUs for tensor parallelism when using --serve-model. "
+            "Defaults to the number of available GPUs."
+        ),
+    )
+    parser.add_argument(
+        "--min-replicas",
+        type=int,
+        default=None,
+        help=(
+            "Minimum number of model replicas when using --serve-model. "
+            "Defaults to num_gpus // tensor_parallel_size."
+        ),
+    )
+    parser.add_argument(
+        "--max-replicas",
+        type=int,
+        default=None,
+        help=(
+            "Maximum number of model replicas when using --serve-model. "
+            "Defaults to num_gpus // tensor_parallel_size."
+        ),
+    )
+    parser.add_argument(
+        "--max-model-len",
+        type=int,
+        default=8192,
+        help=(
+            "Maximum sequence length for the vLLM engine when using --serve-model. "
+            "Reduces KV cache memory. The pipeline needs at most ~4K tokens, so the "
+            "default of 8192 is sufficient. Set higher only if needed."
+        ),
+    )
+    parser.add_argument(
+        "--gpu-memory-utilization",
+        type=float,
+        default=0.9,
+        help="Fraction of GPU memory for vLLM when using --serve-model (0.0-1.0).",
+    )
+
     parser.add_argument(
         "--api-key",
         type=str,
@@ -666,8 +803,12 @@ def attach_args() -> argparse.ArgumentParser:
     parser.add_argument(
         "--max-concurrent-requests",
         type=int,
-        default=3,
-        help="Maximum number of concurrent API requests.",
+        default=32,
+        help=(
+            "Maximum number of concurrent API requests. Increase if GPU "
+            "utilization on the inference server is low (e.g. 256-512 for "
+            "a local --serve-model deployment with multiple replicas)."
+        ),
     )
     parser.add_argument(
         "--max-retries",

From e1b4f11b63e97ee0d101fd50a09cf3ce6be42480 Mon Sep 17 00:00:00 2001
From: Ayush Dattagupta <ayushdg95@gmail.com>
Date: Mon, 18 May 2026 15:47:30 +0000
Subject: [PATCH 2/2] Update readme instructions

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
---
 .../data/curation/nemotron-cc/README.md       | 102 +++++++++++-------
 .../data/curation/nemotron-cc/step_4-sdg.py   |   8 ++
 2 files changed, 71 insertions(+), 39 deletions(-)

diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/README.md b/src/nemotron/recipes/data/curation/nemotron-cc/README.md
index 64a8c49f5..ec518bd1a 100644
--- a/src/nemotron/recipes/data/curation/nemotron-cc/README.md
+++ b/src/nemotron/recipes/data/curation/nemotron-cc/README.md
@@ -11,63 +11,82 @@ This directory contains the recipe for curating datasets similar to the [Nemotro
 
 ### Pipeline Overview
 
-#### Step 1: Download, Extract, and Clean (`step_1-download_extract.py`)
+| # | Script | Compute | Output |
+|---|--------|---------|--------|
+| 1 | [`step_1-download_extract.py`](#step-1-download-extract-and-clean) | CPU | `data/cleaned_extracted/` |
+| 2a | [`step_2a-exact_dedup.py`](#step-2a-exact-deduplication) | GPU + CPU | `data/exact_deduplicated/` |
+| 2b | [`step_2b-fuzzy_dedup.py`](#step-2b-fuzzy-deduplication) | GPU + CPU | `data/fuzzy_deduplicated/` |
+| 2c | [`step_2c-substring_dedup/`](#step-2c-substring-deduplication) | CPU | `data/substring_deduped/` |
+| 3 | [`step_3-quality_classification.py`](#step-3-quality-classification) | GPU + CPU | `data/quality_labeling/bucketed_results/` |
+| 4 | [`step_4-sdg.py`](#step-4-synthetic-data-generation) | GPU or external API | `data/sdg_output/` |
+
+---
+
+#### Step 1: Download, Extract, and Clean
 
 A CPU-only pipeline that produces clean text from raw web data:
 
 - Downloads Common Crawl snapshots (WARC files) and extracts text using JusText.
 - Annotates each document with a language using a FastText language identification model.
 - Fixes mojibake (encoding issues) via Unicode reformatting.
-- **Output:** `data/cleaned_extracted/`
-- **Resources:** CPU-only. We recommend each worker has at least 2GB of RAM to prevent OOM errors.
 
-#### Step 2a: Exact Deduplication (`step_2a-exact_dedup.py`)
+**Resources:** CPU-only. Recommend at least 2GB RAM per worker to prevent OOM.
+
+---
+
+#### Step 2a: Exact Deduplication
+
+Exact deduplication via document hashing. Run `--identify` then `--remove`.
+
+| Phase | Compute | Scale tested / notes |
+|-------|---------|----------------------|
+| `--identify` | GPU | 8× H100 for a single snapshot (~4-10TB). ~128× 80GB GPUs recommended for full Common Crawl. |
+| `--remove` | CPU, ≥6GB RAM/worker | Reads cached duplicate IDs and filters the original dataset. |
 
-Exact deduplication using document hashing:
+---
 
-- **Phase 1 (`--identify`):** Hashes every document and identifies exact duplicates.
-  - **Resources:** Requires GPU(s) for accelerated hashing. For a single snapshot (~4-10TB) extracted we tested with 8 H100 GPUs. For all of Common Crawl we recommend ~128 GPUs with 80GB VRAM per GPU.
-- **Phase 2 (`--remove`):** Removes duplicate documents, keeping one copy.
-  - **Resources:** CPU-only. Reads duplicate IDs from the cache directory and filters the original dataset. We recommend each worker has at-least 6GB of RAM to prevent OOM errors.
-- **Output:** `data/exact_deduplicated/`.
+#### Step 2b: Fuzzy Deduplication
 
-#### Step 2b: Fuzzy Deduplication (`step_2b-fuzzy_dedup.py`)
+Fuzzy deduplication using MinHash + LSH. Run `--identify` then `--remove`.
 
-Fuzzy deduplication using MinHash + LSH:
+| Phase | Compute | Scale tested / notes |
+|-------|---------|----------------------|
+| `--identify` | GPU | 8× H100 for a single snapshot (~1-8TB exact-deduped). |
+| `--remove` | CPU, ≥6GB RAM/worker | Filters using connected-components results. |
 
-- **Phase 1 (`--identify`):** Identify near duplicate docs using MinHash-LSH based duplicate identification.
-  - **Resources:** Requires GPU(s). For a single snapshot (~1-8TB) exact deduplicated we tested with 8 H100 GPUs.
-- **Phase 2 (`--remove`):** Removes fuzzy duplicates based on connected components.
-  - **Resources:** CPU-only. Reads duplicate IDs and filters the original dataset. We recommend each worker has at-least 6GB of RAM to prevent OOM errors.
-- **Output:** `data/fuzzy_deduplicated/`.
+---
 
-#### Step 2c: Substring Deduplication (`step_2c-substring_dedup/`)
+#### Step 2c: Substring Deduplication
 
 CPU-only exact substring deduplication using [Google Research's deduplicate-text-datasets](https://github.com/google-research/deduplicate-text-datasets) ([paper](https://arxiv.org/abs/2107.06499)). Removes duplicate substrings within and across documents using suffix arrays.
 
-- **Resources:** CPU-only. Requires 2-3x the input dataset size in RAM and 10-15x in disk space. We recommend splitting data into 100GB chunks.
-- **Output:** `data/substring_deduped/`.
+**Resources:** CPU-only. Requires 2-3× the input dataset size in RAM and 10-15× on disk. Recommend splitting data into 100GB chunks.
 
 See the [step_2c README](./step_2c-substring_dedup/README.md) for detailed instructions and debugging tips.
 
-#### Step 3: Quality Classification (`step_3-quality_classification.py`)
+---
 
-Ensemble quality scoring and bucketing into 20 quality tiers:
+#### Step 3: Quality Classification
 
-- **Phase 1 (`--classify`):** Filters to English, then runs three quality classifiers in parallel:
-  - [FineWebNemotronEduClassifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier)
-  - [FineWebMixtralEduClassifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier)
-  - [FastText quality filter (`fasttext-oh-eli5`)](https://huggingface.co/mlfoundations/fasttext-oh-eli5)
-  - **Resources:** Requires GPU(s) for the neural classifiers. For a single snapshot we tested with 64 H100 GPUs. This scale is embarrassingly parallel so use fewer/more GPUs as needed with at least 80GB VRAM per GPU.
-- **Phase 2 (`--ensemble`):** Computes token-weighted percentile thresholds from sampled classification scores, maps float scores to integer bins (0-19), takes the per-document max across classifiers as the ensemble score.
-  - **Resources:** CPU-only. Reads classification results and computes thresholds and bucketing. Tested with max `fraction=0.1` on a machine with 200GB ram. For OOM errors would recommend reducing the sampling fraction.
-- **Output:** `data/quality_labeling/bucketed_results/ensemble-max-int={0-19}/` partitioned by quality bucket (0 = lowest, 19 = highest).
+Ensemble quality scoring and bucketing into 20 quality tiers. Run `--classify` then `--ensemble`.
 
-#### Step 4: Synthetic Data Generation (`step_4-sdg.py`)
+| Phase | Compute | Scale tested / notes |
+|-------|---------|----------------------|
+| `--classify` | GPU, ≥80GB VRAM | Filters to English, runs three classifiers in parallel. Tested at 64× H100 per snapshot; embarrassingly parallel — scale up/down freely. |
+| `--ensemble` | CPU | Computes token-weighted percentile thresholds and per-document max across classifiers. Tested at `fraction=0.1` on 200GB RAM; reduce sampling fraction if OOM. |
 
-LLM-based synthetic data generation on the highest-quality documents (buckets 18 and 19).
+Classifiers used:
+- [FineWebNemotronEduClassifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier)
+- [FineWebMixtralEduClassifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier)
+- [FastText quality filter (`fasttext-oh-eli5`)](https://huggingface.co/mlfoundations/fasttext-oh-eli5)
 
-Four generation tasks:
+**Output layout:** `data/quality_labeling/bucketed_results/ensemble-max-int={0-19}/` partitioned by bucket (0 = lowest, 19 = highest).
+
+---
+
+#### Step 4: Synthetic Data Generation
+
+LLM-based synthetic data generation on the highest-quality documents (buckets 18 and 19). Four generation tasks:
 
 | Task | Description | Max Input / Output Tokens |
 |------|-------------|---------------------------|
@@ -76,14 +95,19 @@ Four generation tasks:
 | `extract_knowledge` | Rewrites text as textbook/Wikipedia-style passages focused on factual content | 1400 / 1400 |
 | `knowledge_list` | Extracts organized bulleted lists of key facts, concepts, and statistics | 1000 / 600 |
 
-Each task runs as an independent pipeline (preprocessing, LLM generation, postprocessing, write). When `--task all` is used, the four tasks run sequentially. They can also be run as separate processes in parallel.
+Each task is an independent pipeline (preprocess → LLM generate → postprocess → write). `--task all` runs the four sequentially; they can also be launched as parallel processes.
+
+**LLM backends** — pick one:
+
+| Backend | How to select | Notes |
+|---------|---------------|-------|
+| Local inference server (default) | (default) | Spins up a Ray Serve + vLLM deployment of `--model-name` on the local GPU cluster. No API key. Tune with `--tensor-parallel-size`, `--min-replicas`, `--max-replicas`; bump `--max-concurrent-requests` (try 256–512) if GPU utilization is low. |
+| Existing OpenAI-compatible endpoint | `--no-serve-model --base-url <url>` | Self-hosted vLLM/TRT-LLM/NIM or any OpenAI-compatible cloud provider. `--api-key` forwarded if set. |
+| [NVIDIA Build](https://build.nvidia.com/) | `--no-serve-model` | Uses the default `--base-url`. Requires `--api-key` (or `NVIDIA_API_KEY`). Default `--model-name` is not on NVIDIA Build — set `--model-name` (and `--tokenizer`) to a model that is. |
 
-**LLM backends.** The script supports three ways to reach the model — pick one:
+> **Note on `--tokenizer`:** The tokenizer is loaded via Hugging Face `AutoTokenizer`, so `--tokenizer` must be a Hugging Face repo id (or local path to HF tokenizer files), regardless of which backend you pick. If `--tokenizer` is not set, it defaults to `--model-name`, which in some cases is not a valid HF tokenizer path — for example `--model-name meta/llama-3.3-70b-instruct` needs `--tokenizer meta-llama/Llama-3.3-70B-Instruct` set explicitly.
 
-1. **Local inference server (default).** The script spins up a Ray Serve + vLLM deployment of `--model-name` on the local GPU cluster. No API key needed; scales out across replicas. Tune with `--tensor-parallel-size`, `--min-replicas`, `--max-replicas`. If GPU utilization is low, increase `--max-concurrent-requests` (try 256–512). Pass `--no-serve-model` to use one of the external options below instead.
-2. **Existing OpenAI-compatible endpoint.** Pass `--no-serve-model --base-url <url>` to use a self-hosted vLLM/TRT-LLM/NIM server (or any OpenAI-compatible cloud provider). `--api-key` is forwarded if needed.
-3. **[NVIDIA Build](https://build.nvidia.com/).** Pass `--no-serve-model` to use the default `--base-url`. Requires `--api-key` (or `NVIDIA_API_KEY` env var). The default `--model-name` (`Qwen/Qwen3-30B-A3B-Instruct-2507`) is not on NVIDIA Build, so set `--model-name` (and `--tokenizer`) to a model that is.
+**Defaults:**
 
 - **Default model:** [`Qwen/Qwen3-30B-A3B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507).
-- **Output:** `data/sdg_output/<task_name>/`.
 - **Resources:** With `--serve-model`, GPU(s) for vLLM. Otherwise CPU-only; just needs network access to the chosen endpoint.
diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_4-sdg.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_4-sdg.py
index 939fc46ac..0aac83eaa 100644
--- a/src/nemotron/recipes/data/curation/nemotron-cc/step_4-sdg.py
+++ b/src/nemotron/recipes/data/curation/nemotron-cc/step_4-sdg.py
@@ -49,6 +49,14 @@
      (Qwen3-30B-A3B-Instruct-2507) is not hosted on NVIDIA Build — pass
      --model-name to a model that is.
 
+Note on --tokenizer:
+  The tokenizer is loaded via Hugging Face AutoTokenizer, so --tokenizer
+  must be a Hugging Face repo id (or local path to HF tokenizer files).
+  If --tokenizer is not set, it defaults to --model-name, which in some
+  cases is not a valid HF tokenizer path — e.g. --model-name
+  meta/llama-3.3-70b-instruct needs --tokenizer
+  meta-llama/Llama-3.3-70B-Instruct set explicitly.
+
 Usage:
     # Default: stand up a local inference server (4 GPUs, 4 replicas).
     # Bump --max-concurrent-requests if GPU utilization is low.