Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/nemotron/data/curation/nemotron-cc.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Common Crawl → Extract & Clean → Deduplicate → Quality Classify → Synthe
| 2b | `step_2b-fuzzy_dedup.py` | MinHash + LSH fuzzy deduplication | GPU (identify), CPU (remove) |
| 2c | `step_2c-substring_dedup/` | Exact substring deduplication using suffix arrays | CPU-only |
| 3 | `step_3-quality_classification.py` | Ensemble quality scoring into 20 buckets | GPU (classify), CPU (ensemble) |
| 4 | `step_4-sdg.py` | LLM-based synthetic data generation on top-quality data | CPU + LLM endpoint |
| 4 | `step_4-sdg.py` | LLM-based synthetic data generation on top-quality data | GPU (local inference server, default) or CPU + external LLM endpoint (with `--no-serve-model`) |

Steps 1–3 progressively filter and annotate the data. Step 4 generates synthetic training data (diverse QA, distillation, knowledge extraction, knowledge lists) from the highest-quality documents (buckets 18–19).

Expand All @@ -35,9 +35,9 @@ See the recipe README at `src/nemotron/recipes/data/curation/nemotron-cc/README.

## Prerequisites

- [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) installed with Ray support
- [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) 1.2.0 (26.04 release) or newer, installed with Ray support
- GPU(s) for steps 2a, 2b, and 3 (deduplication and classification)
- Access to an OpenAI-compatible LLM endpoint for step 4 (NVIDIA NIM, vLLM, or cloud API)
- For step 4, one of: GPU(s) to host a local inference server (default), an OpenAI-compatible endpoint (self-hosted vLLM/NIM or cloud, via `--no-serve-model`), or an [NVIDIA Build](https://build.nvidia.com/) API key

## After Curation

Expand Down
108 changes: 69 additions & 39 deletions src/nemotron/recipes/data/curation/nemotron-cc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,70 +4,89 @@ This directory contains the recipe for curating datasets similar to the [Nemotro

### Requirements

- [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) > 1.1.0 ([install from main](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html))
- [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) 1.2.0 (26.04 release) or newer ([install instructions](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html))
- GPU(s) for steps 2a, 2b, and 3 (deduplication and classification)
- [Cargo/Rust](https://doc.rust-lang.org/cargo/getting-started/installation.html) for step 2c (building `deduplicate-text-datasets`)
- Access to an OpenAI-compatible LLM endpoint for step 4 (NVIDIA NIM, vLLM, or cloud API)
- For step 4, one of: GPU(s) to host a local inference server (default), an OpenAI-compatible endpoint (self-hosted vLLM/NIM or cloud), or an [NVIDIA Build](https://build.nvidia.com/) API key

### Pipeline Overview

#### Step 1: Download, Extract, and Clean (`step_1-download_extract.py`)
| # | Script | Compute | Output |
|---|--------|---------|--------|
| 1 | [`step_1-download_extract.py`](#step-1-download-extract-and-clean) | CPU | `data/cleaned_extracted/` |
| 2a | [`step_2a-exact_dedup.py`](#step-2a-exact-deduplication) | GPU + CPU | `data/exact_deduplicated/` |
| 2b | [`step_2b-fuzzy_dedup.py`](#step-2b-fuzzy-deduplication) | GPU + CPU | `data/fuzzy_deduplicated/` |
| 2c | [`step_2c-substring_dedup/`](#step-2c-substring-deduplication) | CPU | `data/substring_deduped/` |
| 3 | [`step_3-quality_classification.py`](#step-3-quality-classification) | GPU + CPU | `data/quality_labeling/bucketed_results/` |
| 4 | [`step_4-sdg.py`](#step-4-synthetic-data-generation) | GPU or external API | `data/sdg_output/` |

---

#### Step 1: Download, Extract, and Clean

A CPU-only pipeline that produces clean text from raw web data:

- Downloads Common Crawl snapshots (WARC files) and extracts text using JusText.
- Annotates each document with a language using a FastText language identification model.
- Fixes mojibake (encoding issues) via Unicode reformatting.
- **Output:** `data/cleaned_extracted/`
- **Resources:** CPU-only. We recommend each worker has at least 2GB of RAM to prevent OOM errors.

#### Step 2a: Exact Deduplication (`step_2a-exact_dedup.py`)
**Resources:** CPU-only. Recommend at least 2GB RAM per worker to prevent OOM.

---

#### Step 2a: Exact Deduplication

Exact deduplication using document hashing:
Exact deduplication via document hashing. Run `--identify` then `--remove`.

- **Phase 1 (`--identify`):** Hashes every document and identifies exact duplicates.
- **Resources:** Requires GPU(s) for accelerated hashing. For a single snapshot (~4-10TB) extracted we tested with 8 H100 GPUs. For all of Common Crawl we recommend ~128 GPUs with 80GB VRAM per GPU.
- **Phase 2 (`--remove`):** Removes duplicate documents, keeping one copy.
- **Resources:** CPU-only. Reads duplicate IDs from the cache directory and filters the original dataset. We recommend each worker has at-least 6GB of RAM to prevent OOM errors.
- **Output:** `data/exact_deduplicated/`.
| Phase | Compute | Scale tested / notes |
|-------|---------|----------------------|
| `--identify` | GPU | 8× H100 for a single snapshot (~4-10TB). ~128× 80GB GPUs recommended for full Common Crawl. |
| `--remove` | CPU, ≥6GB RAM/worker | Reads cached duplicate IDs and filters the original dataset. |

#### Step 2b: Fuzzy Deduplication (`step_2b-fuzzy_dedup.py`)
---

Fuzzy deduplication using MinHash + LSH:
#### Step 2b: Fuzzy Deduplication

- **Phase 1 (`--identify`):** Identify near duplicate docs using MinHash-LSH based duplicate identification.
- **Resources:** Requires GPU(s). For a single snapshot (~1-8TB) exact deduplicated we tested with 8 H100 GPUs.
- **Phase 2 (`--remove`):** Removes fuzzy duplicates based on connected components.
- **Resources:** CPU-only. Reads duplicate IDs and filters the original dataset. We recommend each worker has at-least 6GB of RAM to prevent OOM errors.
- **Output:** `data/fuzzy_deduplicated/`.
Fuzzy deduplication using MinHash + LSH. Run `--identify` then `--remove`.

#### Step 2c: Substring Deduplication (`step_2c-substring_dedup/`)
| Phase | Compute | Scale tested / notes |
|-------|---------|----------------------|
| `--identify` | GPU | 8× H100 for a single snapshot (~1-8TB exact-deduped). |
| `--remove` | CPU, ≥6GB RAM/worker | Filters using connected-components results. |

---

#### Step 2c: Substring Deduplication

CPU-only exact substring deduplication using [Google Research's deduplicate-text-datasets](https://github.com/google-research/deduplicate-text-datasets) ([paper](https://arxiv.org/abs/2107.06499)). Removes duplicate substrings within and across documents using suffix arrays.

- **Resources:** CPU-only. Requires 2-3x the input dataset size in RAM and 10-15x in disk space. We recommend splitting data into 100GB chunks.
- **Output:** `data/substring_deduped/`.
**Resources:** CPU-only. Requires 2-3× the input dataset size in RAM and 10-15× on disk. Recommend splitting data into 100GB chunks.

See the [step_2c README](./step_2c-substring_dedup/README.md) for detailed instructions and debugging tips.

#### Step 3: Quality Classification (`step_3-quality_classification.py`)
---

#### Step 3: Quality Classification

Ensemble quality scoring and bucketing into 20 quality tiers:
Ensemble quality scoring and bucketing into 20 quality tiers. Run `--classify` then `--ensemble`.

- **Phase 1 (`--classify`):** Filters to English, then runs three quality classifiers in parallel:
- [FineWebNemotronEduClassifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier)
- [FineWebMixtralEduClassifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier)
- [FastText quality filter (`fasttext-oh-eli5`)](https://huggingface.co/mlfoundations/fasttext-oh-eli5)
- **Resources:** Requires GPU(s) for the neural classifiers. For a single snapshot we tested with 64 H100 GPUs. This scale is embarrassingly parallel so use fewer/more GPUs as needed with at least 80GB VRAM per GPU.
- **Phase 2 (`--ensemble`):** Computes token-weighted percentile thresholds from sampled classification scores, maps float scores to integer bins (0-19), takes the per-document max across classifiers as the ensemble score.
- **Resources:** CPU-only. Reads classification results and computes thresholds and bucketing. Tested with max `fraction=0.1` on a machine with 200GB ram. For OOM errors would recommend reducing the sampling fraction.
- **Output:** `data/quality_labeling/bucketed_results/ensemble-max-int={0-19}/` partitioned by quality bucket (0 = lowest, 19 = highest).
| Phase | Compute | Scale tested / notes |
|-------|---------|----------------------|
| `--classify` | GPU, ≥80GB VRAM | Filters to English, runs three classifiers in parallel. Tested at 64× H100 per snapshot; embarrassingly parallel — scale up/down freely. |
| `--ensemble` | CPU | Computes token-weighted percentile thresholds and per-document max across classifiers. Tested at `fraction=0.1` on 200GB RAM; reduce sampling fraction if OOM. |

#### Step 4: Synthetic Data Generation (`step_4-sdg.py`)
Classifiers used:
- [FineWebNemotronEduClassifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier)
- [FineWebMixtralEduClassifier](https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier)
- [FastText quality filter (`fasttext-oh-eli5`)](https://huggingface.co/mlfoundations/fasttext-oh-eli5)

LLM-based synthetic data generation on the highest-quality documents (buckets 18 and 19). This is a CPU-only pipeline — LLM inference happens via API calls to an external endpoint (NVIDIA Integrate, or a self-hosted OAI endpoint compatible server).
**Output layout:** `data/quality_labeling/bucketed_results/ensemble-max-int={0-19}/` partitioned by bucket (0 = lowest, 19 = highest).

Four generation tasks:
---

#### Step 4: Synthetic Data Generation

LLM-based synthetic data generation on the highest-quality documents (buckets 18 and 19). Four generation tasks:

| Task | Description | Max Input / Output Tokens |
|------|-------------|---------------------------|
Expand All @@ -76,8 +95,19 @@ Four generation tasks:
| `extract_knowledge` | Rewrites text as textbook/Wikipedia-style passages focused on factual content | 1400 / 1400 |
| `knowledge_list` | Extracts organized bulleted lists of key facts, concepts, and statistics | 1000 / 600 |

Each task runs as an independent pipeline (preprocessing, LLM generation, postprocessing, write). When `--task all` is used, the four tasks run sequentially. They can also be run as separate processes in parallel.
Each task is an independent pipeline (preprocess → LLM generate → postprocess → write). `--task all` runs the four sequentially; they can also be launched as parallel processes.

**LLM backends** — pick one:

| Backend | How to select | Notes |
|---------|---------------|-------|
| Local inference server (default) | (default) | Spins up a Ray Serve + vLLM deployment of `--model-name` on the local GPU cluster. No API key. Tune with `--tensor-parallel-size`, `--min-replicas`, `--max-replicas`; bump `--max-concurrent-requests` (try 256–512) if GPU utilization is low. |
| Existing OpenAI-compatible endpoint | `--no-serve-model --base-url <url>` | Self-hosted vLLM/TRT-LLM/NIM or any OpenAI-compatible cloud provider. `--api-key` forwarded if set. |
| [NVIDIA Build](https://build.nvidia.com/) | `--no-serve-model` | Uses the default `--base-url`. Requires `--api-key` (or `NVIDIA_API_KEY`). Default `--model-name` is not on NVIDIA Build — set `--model-name` (and `--tokenizer`) to a model that is. |

> **Note on `--tokenizer`:** The tokenizer is loaded via Hugging Face `AutoTokenizer`, so `--tokenizer` must be a Hugging Face repo id (or local path to HF tokenizer files), regardless of which backend you pick. If `--tokenizer` is not set, it defaults to `--model-name`, which in some cases is not a valid HF tokenizer path — for example `--model-name meta/llama-3.3-70b-instruct` needs `--tokenizer meta-llama/Llama-3.3-70B-Instruct` set explicitly.

**Defaults:**

- **Default model:** [`Qwen/Qwen3-30B-A3B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507). This model is not available on [NVIDIA Build](https://build.nvidia.com/), so you'll need to provide a `--base-url` pointing to an endpoint serving it (self-hosted via vLLM/NIM, or any OpenAI-compatible cloud provider). Alternatively, you can use any model available on NVIDIA Build by setting `--model-name` and `--tokenizer` accordingly.
- **Output:** `data/sdg_output/<task_name>/`.
- **Resources:** CPU-only for the script itself. Requires access to an LLM endpoint.
- **Default model:** [`Qwen/Qwen3-30B-A3B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507).
- **Resources:** With `--serve-model`, GPU(s) for vLLM. Otherwise CPU-only; just needs network access to the chosen endpoint.
Comment thread
ayushdg marked this conversation as resolved.
Comment thread
ayushdg marked this conversation as resolved.
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
from fsspec.core import url_to_fs
from loguru import logger

from nemo_curator.backends.experimental.ray_data import RayDataExecutor
from nemo_curator.backends.ray_data import RayDataExecutor
from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.base import ProcessingStage
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@

from nemo_curator.stages.file_partitioning import FilePartitioningStage
from nemo_curator.tasks import EmptyTask
from nemo_curator.backends.experimental.ray_data import RayDataExecutor
from nemo_curator.backends.ray_data import RayDataExecutor
from nemo_curator.core.client import RayClient
from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
from nemo_curator.stages.deduplication.id_generator import CURATOR_DEDUP_ID_STR
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@

from loguru import logger

from nemo_curator.backends.experimental.ray_data import RayDataExecutor
from nemo_curator.backends.ray_data import RayDataExecutor
from nemo_curator.core.client import RayClient
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
from nemo_curator.stages.deduplication.id_generator import CURATOR_DEDUP_ID_STR
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ If you are adapting it for Slurm:
3. The `remove_duplicates` step must run on a single exclusive node.

Dependencies:
- `nemo_curator>=1.1.0`
- `nemo_curator>=1.2.0` (26.04 release)
- `cargo` (for building `deduplicate-text-datasets`). See https://doc.rust-lang.org/cargo/getting-started/installation.html

We recommend splitting up the dataset into 100 GB chunks or less, and executing `exact_substring_dedup.sh` on each 100 GB chunk.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
import tiktoken

from nemo_curator.backends.base import WorkerMetadata
from nemo_curator.backends.experimental.ray_data import RayDataExecutor
from nemo_curator.backends.ray_data import RayDataExecutor
from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.base import ProcessingStage
Expand Down Expand Up @@ -91,6 +91,7 @@ def write_id_to_filename(input_file_paths: list[str], output_path: str) -> dict[
input_file_list = [os.path.basename(filename) for filename in input_file_paths]
id_to_filename = {str(i): filename for i, filename in enumerate(input_file_list)}

os.makedirs(output_path, exist_ok=True)
with open(f"{output_path}/id_to_filename.json", "w") as fp:
json.dump(id_to_filename, fp)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
import pandas as pd
from loguru import logger

from nemo_curator.backends.experimental.ray_data import RayDataExecutor
from nemo_curator.backends.ray_data import RayDataExecutor
from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.base import ProcessingStage
Expand Down
Loading
Loading