From 603c6f33b62295a0cd94c1c51a0410c50a8d047c Mon Sep 17 00:00:00 2001 From: Ayush Dattagupta Date: Wed, 13 May 2026 14:05:31 -0700 Subject: [PATCH 1/2] Bulk updates for the 1.2.0 release Signed-off-by: Ayush Dattagupta --- docs/nemotron/data/curation/nemotron-cc.md | 6 +- .../data/curation/nemotron-cc/README.md | 16 +- .../nemotron-cc/step_1-download_extract.py | 2 +- .../nemotron-cc/step_2a-exact_dedup.py | 2 +- .../nemotron-cc/step_2b-fuzzy_dedup.py | 2 +- .../step_2c-substring_dedup/README.md | 2 +- .../prepare_dataset.py | 3 +- .../step_3-quality_classification.py | 2 +- .../data/curation/nemotron-cc/step_4-sdg.py | 293 +++++++++++++----- 9 files changed, 238 insertions(+), 90 deletions(-) diff --git a/docs/nemotron/data/curation/nemotron-cc.md b/docs/nemotron/data/curation/nemotron-cc.md index 77cd4e024..ac9bba4cd 100644 --- a/docs/nemotron/data/curation/nemotron-cc.md +++ b/docs/nemotron/data/curation/nemotron-cc.md @@ -19,7 +19,7 @@ Common Crawl → Extract & Clean → Deduplicate → Quality Classify → Synthe | 2b | `step_2b-fuzzy_dedup.py` | MinHash + LSH fuzzy deduplication | GPU (identify), CPU (remove) | | 2c | `step_2c-substring_dedup/` | Exact substring deduplication using suffix arrays | CPU-only | | 3 | `step_3-quality_classification.py` | Ensemble quality scoring into 20 buckets | GPU (classify), CPU (ensemble) | -| 4 | `step_4-sdg.py` | LLM-based synthetic data generation on top-quality data | CPU + LLM endpoint | +| 4 | `step_4-sdg.py` | LLM-based synthetic data generation on top-quality data | GPU (local inference server, default) or CPU + external LLM endpoint (with `--no-serve-model`) | Steps 1–3 progressively filter and annotate the data. Step 4 generates synthetic training data (diverse QA, distillation, knowledge extraction, knowledge lists) from the highest-quality documents (buckets 18–19). @@ -35,9 +35,9 @@ See the recipe README at `src/nemotron/recipes/data/curation/nemotron-cc/README. ## Prerequisites -- [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) installed with Ray support +- [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) 1.2.0 (26.04 release) or newer, installed with Ray support - GPU(s) for steps 2a, 2b, and 3 (deduplication and classification) -- Access to an OpenAI-compatible LLM endpoint for step 4 (NVIDIA NIM, vLLM, or cloud API) +- For step 4, one of: GPU(s) to host a local inference server (default), an OpenAI-compatible endpoint (self-hosted vLLM/NIM or cloud, via `--no-serve-model`), or an [NVIDIA Build](https://build.nvidia.com/) API key ## After Curation diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/README.md b/src/nemotron/recipes/data/curation/nemotron-cc/README.md index 212702b34..64a8c49f5 100644 --- a/src/nemotron/recipes/data/curation/nemotron-cc/README.md +++ b/src/nemotron/recipes/data/curation/nemotron-cc/README.md @@ -4,10 +4,10 @@ This directory contains the recipe for curating datasets similar to the [Nemotro ### Requirements -- [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) > 1.1.0 ([install from main](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html)) +- [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) 1.2.0 (26.04 release) or newer ([install instructions](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html)) - GPU(s) for steps 2a, 2b, and 3 (deduplication and classification) - [Cargo/Rust](https://doc.rust-lang.org/cargo/getting-started/installation.html) for step 2c (building `deduplicate-text-datasets`) -- Access to an OpenAI-compatible LLM endpoint for step 4 (NVIDIA NIM, vLLM, or cloud API) +- For step 4, one of: GPU(s) to host a local inference server (default), an OpenAI-compatible endpoint (self-hosted vLLM/NIM or cloud), or an [NVIDIA Build](https://build.nvidia.com/) API key ### Pipeline Overview @@ -65,7 +65,7 @@ Ensemble quality scoring and bucketing into 20 quality tiers: #### Step 4: Synthetic Data Generation (`step_4-sdg.py`) -LLM-based synthetic data generation on the highest-quality documents (buckets 18 and 19). This is a CPU-only pipeline — LLM inference happens via API calls to an external endpoint (NVIDIA Integrate, or a self-hosted OAI endpoint compatible server). +LLM-based synthetic data generation on the highest-quality documents (buckets 18 and 19). Four generation tasks: @@ -78,6 +78,12 @@ Four generation tasks: Each task runs as an independent pipeline (preprocessing, LLM generation, postprocessing, write). When `--task all` is used, the four tasks run sequentially. They can also be run as separate processes in parallel. -- **Default model:** [`Qwen/Qwen3-30B-A3B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507). This model is not available on [NVIDIA Build](https://build.nvidia.com/), so you'll need to provide a `--base-url` pointing to an endpoint serving it (self-hosted via vLLM/NIM, or any OpenAI-compatible cloud provider). Alternatively, you can use any model available on NVIDIA Build by setting `--model-name` and `--tokenizer` accordingly. +**LLM backends.** The script supports three ways to reach the model — pick one: + +1. **Local inference server (default).** The script spins up a Ray Serve + vLLM deployment of `--model-name` on the local GPU cluster. No API key needed; scales out across replicas. Tune with `--tensor-parallel-size`, `--min-replicas`, `--max-replicas`. If GPU utilization is low, increase `--max-concurrent-requests` (try 256–512). Pass `--no-serve-model` to use one of the external options below instead. +2. **Existing OpenAI-compatible endpoint.** Pass `--no-serve-model --base-url ` to use a self-hosted vLLM/TRT-LLM/NIM server (or any OpenAI-compatible cloud provider). `--api-key` is forwarded if needed. +3. **[NVIDIA Build](https://build.nvidia.com/).** Pass `--no-serve-model` to use the default `--base-url`. Requires `--api-key` (or `NVIDIA_API_KEY` env var). The default `--model-name` (`Qwen/Qwen3-30B-A3B-Instruct-2507`) is not on NVIDIA Build, so set `--model-name` (and `--tokenizer`) to a model that is. + +- **Default model:** [`Qwen/Qwen3-30B-A3B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507). - **Output:** `data/sdg_output//`. -- **Resources:** CPU-only for the script itself. Requires access to an LLM endpoint. +- **Resources:** With `--serve-model`, GPU(s) for vLLM. Otherwise CPU-only; just needs network access to the chosen endpoint. diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_1-download_extract.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_1-download_extract.py index 348a0e86a..d1ea9ba76 100644 --- a/src/nemotron/recipes/data/curation/nemotron-cc/step_1-download_extract.py +++ b/src/nemotron/recipes/data/curation/nemotron-cc/step_1-download_extract.py @@ -29,7 +29,7 @@ from fsspec.core import url_to_fs from loguru import logger -from nemo_curator.backends.experimental.ray_data import RayDataExecutor +from nemo_curator.backends.ray_data import RayDataExecutor from nemo_curator.core.client import RayClient from nemo_curator.pipeline import Pipeline from nemo_curator.stages.base import ProcessingStage diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_2a-exact_dedup.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_2a-exact_dedup.py index c754f649c..2656e7b61 100644 --- a/src/nemotron/recipes/data/curation/nemotron-cc/step_2a-exact_dedup.py +++ b/src/nemotron/recipes/data/curation/nemotron-cc/step_2a-exact_dedup.py @@ -37,7 +37,7 @@ from nemo_curator.stages.file_partitioning import FilePartitioningStage from nemo_curator.tasks import EmptyTask -from nemo_curator.backends.experimental.ray_data import RayDataExecutor +from nemo_curator.backends.ray_data import RayDataExecutor from nemo_curator.core.client import RayClient from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow from nemo_curator.stages.deduplication.id_generator import CURATOR_DEDUP_ID_STR diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_2b-fuzzy_dedup.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_2b-fuzzy_dedup.py index 82b2ea5f5..d14b04b5b 100644 --- a/src/nemotron/recipes/data/curation/nemotron-cc/step_2b-fuzzy_dedup.py +++ b/src/nemotron/recipes/data/curation/nemotron-cc/step_2b-fuzzy_dedup.py @@ -36,7 +36,7 @@ from loguru import logger -from nemo_curator.backends.experimental.ray_data import RayDataExecutor +from nemo_curator.backends.ray_data import RayDataExecutor from nemo_curator.core.client import RayClient from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow from nemo_curator.stages.deduplication.id_generator import CURATOR_DEDUP_ID_STR diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/README.md b/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/README.md index 82945b43c..7fe68072a 100644 --- a/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/README.md +++ b/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/README.md @@ -13,7 +13,7 @@ If you are adapting it for Slurm: 3. The `remove_duplicates` step must run on a single exclusive node. Dependencies: -- `nemo_curator>=1.1.0` +- `nemo_curator>=1.2.0` (26.04 release) - `cargo` (for building `deduplicate-text-datasets`). See https://doc.rust-lang.org/cargo/getting-started/installation.html We recommend splitting up the dataset into 100 GB chunks or less, and executing `exact_substring_dedup.sh` on each 100 GB chunk. diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/prepare_dataset.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/prepare_dataset.py index 24515dcf2..d4c2d0b61 100644 --- a/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/prepare_dataset.py +++ b/src/nemotron/recipes/data/curation/nemotron-cc/step_2c-substring_dedup/prepare_dataset.py @@ -21,7 +21,7 @@ import tiktoken from nemo_curator.backends.base import WorkerMetadata -from nemo_curator.backends.experimental.ray_data import RayDataExecutor +from nemo_curator.backends.ray_data import RayDataExecutor from nemo_curator.core.client import RayClient from nemo_curator.pipeline import Pipeline from nemo_curator.stages.base import ProcessingStage @@ -91,6 +91,7 @@ def write_id_to_filename(input_file_paths: list[str], output_path: str) -> dict[ input_file_list = [os.path.basename(filename) for filename in input_file_paths] id_to_filename = {str(i): filename for i, filename in enumerate(input_file_list)} + os.makedirs(output_path, exist_ok=True) with open(f"{output_path}/id_to_filename.json", "w") as fp: json.dump(id_to_filename, fp) diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_3-quality_classification.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_3-quality_classification.py index 32e6a2151..6efa1e798 100644 --- a/src/nemotron/recipes/data/curation/nemotron-cc/step_3-quality_classification.py +++ b/src/nemotron/recipes/data/curation/nemotron-cc/step_3-quality_classification.py @@ -41,7 +41,7 @@ import pandas as pd from loguru import logger -from nemo_curator.backends.experimental.ray_data import RayDataExecutor +from nemo_curator.backends.ray_data import RayDataExecutor from nemo_curator.core.client import RayClient from nemo_curator.pipeline import Pipeline from nemo_curator.stages.base import ProcessingStage diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_4-sdg.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_4-sdg.py index 85eea7e42..939fc46ac 100644 --- a/src/nemotron/recipes/data/curation/nemotron-cc/step_4-sdg.py +++ b/src/nemotron/recipes/data/curation/nemotron-cc/step_4-sdg.py @@ -30,25 +30,45 @@ to an LLM for generation, postprocesses the results, and writes output to parquet or JSONL. -This is a CPU-only pipeline — LLM inference happens via API calls to an -external endpoint (NVIDIA Integrate, or a self-hosted vLLM/TRT-LLM server). +LLM backends (pick one): + + 1. Local inference server (default). + The script spins up a Ray Serve + vLLM deployment of --model-name on + the local cluster and routes generation through it. No API key + required, and it scales out across replicas. Pass --no-serve-model + to opt out and use one of the external options below. + + 2. Existing OpenAI-compatible endpoint. + Pass --no-serve-model --base-url for a self-hosted vLLM/TRT-LLM/ + NIM server (or any OpenAI-compatible cloud provider). --api-key is + forwarded if set. + + 3. NVIDIA Build (build.nvidia.com). + Pass --no-serve-model to use the default --base-url. Requires + --api-key (or NVIDIA_API_KEY env var). Note: the default --model-name + (Qwen3-30B-A3B-Instruct-2507) is not hosted on NVIDIA Build — pass + --model-name to a model that is. Usage: - # Run all SDG tasks on bucket 18/19 data + # Default: stand up a local inference server (4 GPUs, 4 replicas). + # Bump --max-concurrent-requests if GPU utilization is low. python step_4-sdg.py \\ --task all \\ - --tokenizer Qwen/Qwen3-30B-A3B-Instruct-2507 + --tensor-parallel-size 1 - # Run a single task + # Hit an existing self-hosted OpenAI-compatible endpoint python step_4-sdg.py \\ - --task diverse_qa \\ + --task all \\ + --no-serve-model \\ + --base-url http://localhost:8000/v1 \\ --tokenizer Qwen/Qwen3-30B-A3B-Instruct-2507 - # Use a local/self-hosted endpoint + # Hit NVIDIA Build (set NVIDIA_API_KEY in env or pass --api-key) python step_4-sdg.py \\ - --task distill \\ - --base-url http://localhost:8000/v1 \\ - --tokenizer Qwen/Qwen3-30B-A3B-Instruct-2507 + --task all \\ + --no-serve-model \\ + --model-name meta/llama-3.3-70b-instruct \\ + --tokenizer meta-llama/Llama-3.3-70B-Instruct See README.md in this directory for detailed usage instructions. """ @@ -61,7 +81,7 @@ from loguru import logger from transformers import AutoTokenizer -from nemo_curator.backends.experimental.ray_data import RayDataExecutor +from nemo_curator.backends.ray_data import RayDataExecutor from nemo_curator.backends.xenna import XennaExecutor from nemo_curator.core.client import RayClient from nemo_curator.models.client.llm_client import GenerationConfig @@ -524,79 +544,140 @@ def run_task( return metrics -def main(args: argparse.Namespace) -> None: - ray_client = RayClient(num_cpus=args.num_cpus, include_dashboard=False) - ray_client.start() - - tokenizer_name = args.tokenizer if args.tokenizer else args.model_name - tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) - args.hf_token = os.environ.get("HF_TOKEN", "") - - api_key = args.api_key - if not api_key: - msg = ( - "API key is required. Set NVIDIA_API_KEY environment variable or use --api-key. " - "Get your API key from https://build.nvidia.com/settings/api-keys" - ) - raise ValueError(msg) +def _start_inference_server(args: argparse.Namespace): + """Start a local Ray Serve inference server for the model. - llm_client = AsyncOpenAIClient( - api_key=api_key, - base_url=args.base_url, - max_concurrent_requests=args.max_concurrent_requests, - max_retries=args.max_retries, - base_delay=args.base_delay, - timeout=args.timeout, + Returns the InferenceServer instance. + """ + from nemo_curator.backends.utils import get_available_cpu_gpu_resources + from nemo_curator.core.serve import InferenceModelConfig, InferenceServer + + _, num_gpus = get_available_cpu_gpu_resources() + num_gpus = int(num_gpus) + + tp_size = args.tensor_parallel_size if args.tensor_parallel_size is not None else num_gpus + default_replicas = max(num_gpus // tp_size, 1) + min_replicas = args.min_replicas if args.min_replicas is not None else default_replicas + max_replicas = args.max_replicas if args.max_replicas is not None else default_replicas + logger.info( + f"Starting local inference server with tensor_parallel_size={tp_size}, " + f"min_replicas={min_replicas}, max_replicas={max_replicas}" ) - generation_config = GenerationConfig( - temperature=args.temperature if args.temperature is not None else GENERATION_DEFAULTS["temperature"], - top_p=args.top_p if args.top_p is not None else GENERATION_DEFAULTS["top_p"], - top_k=args.top_k if args.top_k is not None else GENERATION_DEFAULTS["top_k"], - max_tokens=args.max_tokens if args.max_tokens is not None else GENERATION_DEFAULTS["max_output_tokens"], - stop=args.end_strings if args.end_strings is not None else GENERATION_DEFAULTS["end_strings"], - seed=args.seed, + server_config = InferenceModelConfig( + model_identifier=args.model_name, + deployment_config={ + "autoscaling_config": { + "min_replicas": min_replicas, + "max_replicas": max_replicas, + }, + }, + engine_kwargs={ + "tensor_parallel_size": tp_size, + "max_model_len": args.max_model_len, + "gpu_memory_utilization": args.gpu_memory_utilization, + }, ) - if args.task == "all": - tasks = list(TASK_CONFIGS.keys()) - else: - tasks = [args.task] - - os.makedirs(args.output_dir, exist_ok=True) - - logger.info("Nemotron-CC SDG Pipeline") - logger.info(f" Input: {args.input_dir}") - logger.info(f" Output: {args.output_dir}") - logger.info(f" Buckets: {HIGH_QUALITY_BUCKETS}") - logger.info(f" Tasks: {tasks}") - logger.info(f" Model: {args.model_name}") - logger.info(f" Tokenizer: {tokenizer_name}") - - all_metrics = {} - total_start = time.perf_counter() - - for task_name in tasks: - task_metrics = run_task( - task_name=task_name, - args=args, - llm_client=llm_client, - generation_config=generation_config, - tokenizer=tokenizer, + server = InferenceServer(models=[server_config]) + server.start() + logger.info(f"Local inference server ready at {server.endpoint}") + return server + + +def main(args: argparse.Namespace) -> None: + inference_server = None + + try: + tokenizer_name = args.tokenizer if args.tokenizer else args.model_name + tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) + args.hf_token = os.environ.get("HF_TOKEN", "") + + # Start the Ray cluster FIRST — InferenceServer requires an existing + # cluster to deploy Serve actors onto. + ray_client = RayClient(num_cpus=args.num_cpus, include_dashboard=False) + ray_client.start() + + if args.serve_model: + import ray + if not ray.is_initialized(): + ray.init(address="auto", ignore_reinit_error=True) + inference_server = _start_inference_server(args) + base_url = inference_server.endpoint + api_key = "unused" + else: + base_url = args.base_url + api_key = args.api_key + if not api_key: + msg = ( + "API key is required. Set NVIDIA_API_KEY environment variable or use --api-key. " + "Get your API key from https://build.nvidia.com/settings/api-keys" + ) + raise ValueError(msg) + + llm_client = AsyncOpenAIClient( + api_key=api_key, + base_url=base_url, + max_concurrent_requests=args.max_concurrent_requests, + max_retries=args.max_retries, + base_delay=args.base_delay, + timeout=args.timeout, ) - all_metrics[task_name] = task_metrics - _save_metrics( - task_metrics, - os.path.join(args.output_dir, f"{task_name}_metrics.json"), + + generation_config = GenerationConfig( + temperature=args.temperature if args.temperature is not None else GENERATION_DEFAULTS["temperature"], + top_p=args.top_p if args.top_p is not None else GENERATION_DEFAULTS["top_p"], + top_k=args.top_k if args.top_k is not None else GENERATION_DEFAULTS["top_k"], + max_tokens=args.max_tokens if args.max_tokens is not None else GENERATION_DEFAULTS["max_output_tokens"], + stop=args.end_strings if args.end_strings is not None else GENERATION_DEFAULTS["end_strings"], + seed=args.seed, ) - total_elapsed = time.perf_counter() - total_start - all_metrics["total_elapsed_s"] = round(total_elapsed, 2) - _save_metrics(all_metrics, os.path.join(args.output_dir, "sdg_metrics.json")) + if args.task == "all": + tasks = list(TASK_CONFIGS.keys()) + else: + tasks = [args.task] + + os.makedirs(args.output_dir, exist_ok=True) + + logger.info("Nemotron-CC SDG Pipeline") + logger.info(f" Input: {args.input_dir}") + logger.info(f" Output: {args.output_dir}") + logger.info(f" Buckets: {HIGH_QUALITY_BUCKETS}") + logger.info(f" Tasks: {tasks}") + logger.info(f" Model: {args.model_name}") + logger.info(f" Tokenizer: {tokenizer_name}") + logger.info(f" Endpoint: {base_url}") + if args.serve_model: + logger.info(" Mode: local inference server") + + all_metrics = {} + total_start = time.perf_counter() + + for task_name in tasks: + task_metrics = run_task( + task_name=task_name, + args=args, + llm_client=llm_client, + generation_config=generation_config, + tokenizer=tokenizer, + ) + all_metrics[task_name] = task_metrics + _save_metrics( + task_metrics, + os.path.join(args.output_dir, f"{task_name}_metrics.json"), + ) + + total_elapsed = time.perf_counter() - total_start + all_metrics["total_elapsed_s"] = round(total_elapsed, 2) + _save_metrics(all_metrics, os.path.join(args.output_dir, "sdg_metrics.json")) - logger.info(f"All SDG tasks completed in {total_elapsed:.1f}s ({total_elapsed / 60:.1f}m)") + logger.info(f"All SDG tasks completed in {total_elapsed:.1f}s ({total_elapsed / 60:.1f}m)") - ray_client.stop() + finally: + if inference_server is not None: + inference_server.stop() + ray_client.stop() def attach_args() -> argparse.ArgumentParser: @@ -651,6 +732,62 @@ def attach_args() -> argparse.ArgumentParser: help="Number of CPUs for the local Ray cluster (default: all available).", ) + parser.add_argument( + "--serve-model", + action=argparse.BooleanOptionalAction, + default=True, + help=( + "Start a local Ray Serve inference server for the model instead of " + "calling an external API (default: True). Requires GPUs and the model " + "weights to be accessible (downloaded automatically from HuggingFace). " + "When enabled, --base-url and --api-key are ignored. Pass " + "--no-serve-model to disable and use --base-url instead." + ), + ) + parser.add_argument( + "--tensor-parallel-size", + type=int, + default=None, + help=( + "Number of GPUs for tensor parallelism when using --serve-model. " + "Defaults to the number of available GPUs." + ), + ) + parser.add_argument( + "--min-replicas", + type=int, + default=None, + help=( + "Minimum number of model replicas when using --serve-model. " + "Defaults to num_gpus // tensor_parallel_size." + ), + ) + parser.add_argument( + "--max-replicas", + type=int, + default=None, + help=( + "Maximum number of model replicas when using --serve-model. " + "Defaults to num_gpus // tensor_parallel_size." + ), + ) + parser.add_argument( + "--max-model-len", + type=int, + default=8192, + help=( + "Maximum sequence length for the vLLM engine when using --serve-model. " + "Reduces KV cache memory. The pipeline needs at most ~4K tokens, so the " + "default of 8192 is sufficient. Set higher only if needed." + ), + ) + parser.add_argument( + "--gpu-memory-utilization", + type=float, + default=0.9, + help="Fraction of GPU memory for vLLM when using --serve-model (0.0-1.0).", + ) + parser.add_argument( "--api-key", type=str, @@ -666,8 +803,12 @@ def attach_args() -> argparse.ArgumentParser: parser.add_argument( "--max-concurrent-requests", type=int, - default=3, - help="Maximum number of concurrent API requests.", + default=32, + help=( + "Maximum number of concurrent API requests. Increase if GPU " + "utilization on the inference server is low (e.g. 256-512 for " + "a local --serve-model deployment with multiple replicas)." + ), ) parser.add_argument( "--max-retries", From e1b4f11b63e97ee0d101fd50a09cf3ce6be42480 Mon Sep 17 00:00:00 2001 From: Ayush Dattagupta Date: Mon, 18 May 2026 15:47:30 +0000 Subject: [PATCH 2/2] Update readme instructions Signed-off-by: Ayush Dattagupta --- .../data/curation/nemotron-cc/README.md | 102 +++++++++++------- .../data/curation/nemotron-cc/step_4-sdg.py | 8 ++ 2 files changed, 71 insertions(+), 39 deletions(-) diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/README.md b/src/nemotron/recipes/data/curation/nemotron-cc/README.md index 64a8c49f5..ec518bd1a 100644 --- a/src/nemotron/recipes/data/curation/nemotron-cc/README.md +++ b/src/nemotron/recipes/data/curation/nemotron-cc/README.md @@ -11,63 +11,82 @@ This directory contains the recipe for curating datasets similar to the [Nemotro ### Pipeline Overview -#### Step 1: Download, Extract, and Clean (`step_1-download_extract.py`) +| # | Script | Compute | Output | +|---|--------|---------|--------| +| 1 | [`step_1-download_extract.py`](#step-1-download-extract-and-clean) | CPU | `data/cleaned_extracted/` | +| 2a | [`step_2a-exact_dedup.py`](#step-2a-exact-deduplication) | GPU + CPU | `data/exact_deduplicated/` | +| 2b | [`step_2b-fuzzy_dedup.py`](#step-2b-fuzzy-deduplication) | GPU + CPU | `data/fuzzy_deduplicated/` | +| 2c | [`step_2c-substring_dedup/`](#step-2c-substring-deduplication) | CPU | `data/substring_deduped/` | +| 3 | [`step_3-quality_classification.py`](#step-3-quality-classification) | GPU + CPU | `data/quality_labeling/bucketed_results/` | +| 4 | [`step_4-sdg.py`](#step-4-synthetic-data-generation) | GPU or external API | `data/sdg_output/` | + +--- + +#### Step 1: Download, Extract, and Clean A CPU-only pipeline that produces clean text from raw web data: - Downloads Common Crawl snapshots (WARC files) and extracts text using JusText. - Annotates each document with a language using a FastText language identification model. - Fixes mojibake (encoding issues) via Unicode reformatting. -- **Output:** `data/cleaned_extracted/` -- **Resources:** CPU-only. We recommend each worker has at least 2GB of RAM to prevent OOM errors. -#### Step 2a: Exact Deduplication (`step_2a-exact_dedup.py`) +**Resources:** CPU-only. Recommend at least 2GB RAM per worker to prevent OOM. + +--- + +#### Step 2a: Exact Deduplication + +Exact deduplication via document hashing. Run `--identify` then `--remove`. + +| Phase | Compute | Scale tested / notes | +|-------|---------|----------------------| +| `--identify` | GPU | 8× H100 for a single snapshot (~4-10TB). ~128× 80GB GPUs recommended for full Common Crawl. | +| `--remove` | CPU, ≥6GB RAM/worker | Reads cached duplicate IDs and filters the original dataset. | -Exact deduplication using document hashing: +--- -- **Phase 1 (`--identify`):** Hashes every document and identifies exact duplicates. - - **Resources:** Requires GPU(s) for accelerated hashing. For a single snapshot (~4-10TB) extracted we tested with 8 H100 GPUs. For all of Common Crawl we recommend ~128 GPUs with 80GB VRAM per GPU. -- **Phase 2 (`--remove`):** Removes duplicate documents, keeping one copy. - - **Resources:** CPU-only. Reads duplicate IDs from the cache directory and filters the original dataset. We recommend each worker has at-least 6GB of RAM to prevent OOM errors. -- **Output:** `data/exact_deduplicated/`. +#### Step 2b: Fuzzy Deduplication -#### Step 2b: Fuzzy Deduplication (`step_2b-fuzzy_dedup.py`) +Fuzzy deduplication using MinHash + LSH. Run `--identify` then `--remove`. -Fuzzy deduplication using MinHash + LSH: +| Phase | Compute | Scale tested / notes | +|-------|---------|----------------------| +| `--identify` | GPU | 8× H100 for a single snapshot (~1-8TB exact-deduped). | +| `--remove` | CPU, ≥6GB RAM/worker | Filters using connected-components results. | -- **Phase 1 (`--identify`):** Identify near duplicate docs using MinHash-LSH based duplicate identification. - - **Resources:** Requires GPU(s). For a single snapshot (~1-8TB) exact deduplicated we tested with 8 H100 GPUs. -- **Phase 2 (`--remove`):** Removes fuzzy duplicates based on connected components. - - **Resources:** CPU-only. Reads duplicate IDs and filters the original dataset. We recommend each worker has at-least 6GB of RAM to prevent OOM errors. -- **Output:** `data/fuzzy_deduplicated/`. +--- -#### Step 2c: Substring Deduplication (`step_2c-substring_dedup/`) +#### Step 2c: Substring Deduplication CPU-only exact substring deduplication using [Google Research's deduplicate-text-datasets](https://github.com/google-research/deduplicate-text-datasets) ([paper](https://arxiv.org/abs/2107.06499)). Removes duplicate substrings within and across documents using suffix arrays. -- **Resources:** CPU-only. Requires 2-3x the input dataset size in RAM and 10-15x in disk space. We recommend splitting data into 100GB chunks. -- **Output:** `data/substring_deduped/`. +**Resources:** CPU-only. Requires 2-3× the input dataset size in RAM and 10-15× on disk. Recommend splitting data into 100GB chunks. See the [step_2c README](./step_2c-substring_dedup/README.md) for detailed instructions and debugging tips. -#### Step 3: Quality Classification (`step_3-quality_classification.py`) +--- -Ensemble quality scoring and bucketing into 20 quality tiers: +#### Step 3: Quality Classification -- **Phase 1 (`--classify`):** Filters to English, then runs three quality classifiers in parallel: - - [FineWebNemotronEduClassifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier) - - [FineWebMixtralEduClassifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier) - - [FastText quality filter (`fasttext-oh-eli5`)](https://huggingface.co/mlfoundations/fasttext-oh-eli5) - - **Resources:** Requires GPU(s) for the neural classifiers. For a single snapshot we tested with 64 H100 GPUs. This scale is embarrassingly parallel so use fewer/more GPUs as needed with at least 80GB VRAM per GPU. -- **Phase 2 (`--ensemble`):** Computes token-weighted percentile thresholds from sampled classification scores, maps float scores to integer bins (0-19), takes the per-document max across classifiers as the ensemble score. - - **Resources:** CPU-only. Reads classification results and computes thresholds and bucketing. Tested with max `fraction=0.1` on a machine with 200GB ram. For OOM errors would recommend reducing the sampling fraction. -- **Output:** `data/quality_labeling/bucketed_results/ensemble-max-int={0-19}/` partitioned by quality bucket (0 = lowest, 19 = highest). +Ensemble quality scoring and bucketing into 20 quality tiers. Run `--classify` then `--ensemble`. -#### Step 4: Synthetic Data Generation (`step_4-sdg.py`) +| Phase | Compute | Scale tested / notes | +|-------|---------|----------------------| +| `--classify` | GPU, ≥80GB VRAM | Filters to English, runs three classifiers in parallel. Tested at 64× H100 per snapshot; embarrassingly parallel — scale up/down freely. | +| `--ensemble` | CPU | Computes token-weighted percentile thresholds and per-document max across classifiers. Tested at `fraction=0.1` on 200GB RAM; reduce sampling fraction if OOM. | -LLM-based synthetic data generation on the highest-quality documents (buckets 18 and 19). +Classifiers used: +- [FineWebNemotronEduClassifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier) +- [FineWebMixtralEduClassifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier) +- [FastText quality filter (`fasttext-oh-eli5`)](https://huggingface.co/mlfoundations/fasttext-oh-eli5) -Four generation tasks: +**Output layout:** `data/quality_labeling/bucketed_results/ensemble-max-int={0-19}/` partitioned by bucket (0 = lowest, 19 = highest). + +--- + +#### Step 4: Synthetic Data Generation + +LLM-based synthetic data generation on the highest-quality documents (buckets 18 and 19). Four generation tasks: | Task | Description | Max Input / Output Tokens | |------|-------------|---------------------------| @@ -76,14 +95,19 @@ Four generation tasks: | `extract_knowledge` | Rewrites text as textbook/Wikipedia-style passages focused on factual content | 1400 / 1400 | | `knowledge_list` | Extracts organized bulleted lists of key facts, concepts, and statistics | 1000 / 600 | -Each task runs as an independent pipeline (preprocessing, LLM generation, postprocessing, write). When `--task all` is used, the four tasks run sequentially. They can also be run as separate processes in parallel. +Each task is an independent pipeline (preprocess → LLM generate → postprocess → write). `--task all` runs the four sequentially; they can also be launched as parallel processes. + +**LLM backends** — pick one: + +| Backend | How to select | Notes | +|---------|---------------|-------| +| Local inference server (default) | (default) | Spins up a Ray Serve + vLLM deployment of `--model-name` on the local GPU cluster. No API key. Tune with `--tensor-parallel-size`, `--min-replicas`, `--max-replicas`; bump `--max-concurrent-requests` (try 256–512) if GPU utilization is low. | +| Existing OpenAI-compatible endpoint | `--no-serve-model --base-url ` | Self-hosted vLLM/TRT-LLM/NIM or any OpenAI-compatible cloud provider. `--api-key` forwarded if set. | +| [NVIDIA Build](https://build.nvidia.com/) | `--no-serve-model` | Uses the default `--base-url`. Requires `--api-key` (or `NVIDIA_API_KEY`). Default `--model-name` is not on NVIDIA Build — set `--model-name` (and `--tokenizer`) to a model that is. | -**LLM backends.** The script supports three ways to reach the model — pick one: +> **Note on `--tokenizer`:** The tokenizer is loaded via Hugging Face `AutoTokenizer`, so `--tokenizer` must be a Hugging Face repo id (or local path to HF tokenizer files), regardless of which backend you pick. If `--tokenizer` is not set, it defaults to `--model-name`, which in some cases is not a valid HF tokenizer path — for example `--model-name meta/llama-3.3-70b-instruct` needs `--tokenizer meta-llama/Llama-3.3-70B-Instruct` set explicitly. -1. **Local inference server (default).** The script spins up a Ray Serve + vLLM deployment of `--model-name` on the local GPU cluster. No API key needed; scales out across replicas. Tune with `--tensor-parallel-size`, `--min-replicas`, `--max-replicas`. If GPU utilization is low, increase `--max-concurrent-requests` (try 256–512). Pass `--no-serve-model` to use one of the external options below instead. -2. **Existing OpenAI-compatible endpoint.** Pass `--no-serve-model --base-url ` to use a self-hosted vLLM/TRT-LLM/NIM server (or any OpenAI-compatible cloud provider). `--api-key` is forwarded if needed. -3. **[NVIDIA Build](https://build.nvidia.com/).** Pass `--no-serve-model` to use the default `--base-url`. Requires `--api-key` (or `NVIDIA_API_KEY` env var). The default `--model-name` (`Qwen/Qwen3-30B-A3B-Instruct-2507`) is not on NVIDIA Build, so set `--model-name` (and `--tokenizer`) to a model that is. +**Defaults:** - **Default model:** [`Qwen/Qwen3-30B-A3B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507). -- **Output:** `data/sdg_output//`. - **Resources:** With `--serve-model`, GPU(s) for vLLM. Otherwise CPU-only; just needs network access to the chosen endpoint. diff --git a/src/nemotron/recipes/data/curation/nemotron-cc/step_4-sdg.py b/src/nemotron/recipes/data/curation/nemotron-cc/step_4-sdg.py index 939fc46ac..0aac83eaa 100644 --- a/src/nemotron/recipes/data/curation/nemotron-cc/step_4-sdg.py +++ b/src/nemotron/recipes/data/curation/nemotron-cc/step_4-sdg.py @@ -49,6 +49,14 @@ (Qwen3-30B-A3B-Instruct-2507) is not hosted on NVIDIA Build — pass --model-name to a model that is. +Note on --tokenizer: + The tokenizer is loaded via Hugging Face AutoTokenizer, so --tokenizer + must be a Hugging Face repo id (or local path to HF tokenizer files). + If --tokenizer is not set, it defaults to --model-name, which in some + cases is not a valid HF tokenizer path — e.g. --model-name + meta/llama-3.3-70b-instruct needs --tokenizer + meta-llama/Llama-3.3-70B-Instruct set explicitly. + Usage: # Default: stand up a local inference server (4 GPUs, 4 replicas). # Bump --max-concurrent-requests if GPU utilization is low.