End-to-end curation pipelines for the ContinuousBench benchmark family. Each pipeline ingests raw source data, runs deduplication and clustering, generates question-answer pairs through a multi-stage LLM workflow, and produces eval-ready corpus slices and QA splits that can be pushed to HuggingFace as proper datasets with config + revision tags.
Two pipelines live here:
- geminon_curation/ — synthetic dataset of 600 fictional Pokémon-like creatures with stats, names, ~1.5M LLM-generated corpus articles, and 8,400 factual QA pairs. Source data is fully synthetic so the pipeline runs offline (no LLM calls past the naming + corpus generation stages).
- news_curation/ — news QA dataset built from a Common Crawl News dump. Pulls WARCs, extracts articles via
trafilatura, dedupes, clusters events with windowed kNN + Leiden, then runs a 5-stage LLM workflow (fact extraction → QA generation → zero-shot eval → judge → support check + open-book) to produce factual QAs grounded in real news articles.
Both pipelines share the same design principles and were intentionally written in parallel:
- Numbered stage scripts. Each stage reads its input, writes its output, and can be re-run independently. No hidden state, no orchestration framework.
- Decoupled LLM calls. Every script either generates prompts (saved as JSONL) or processes responses. The actual API calls happen in
tools/query_gemini.py, which can be replaced with any batch system. This means you can stop after a save-prompts stage, send the prompts through your own infrastructure, and resume. - Versioned outputs. Each pipeline writes to
output/{version}/.... Bumpingversioninconfig.yamlcreates a fresh dataset version without touching the old one. - Config-driven. All version-specific knobs (seeds, dedup thresholds, model choices, sample sizes, prompt templates) live in
config.yaml. No hardcoded paths or magic numbers in the stage scripts.
ContinuousBenchCuration/
├── README.md ← you are here
├── requirements.txt ← combined deps for both pipelines + tools/
├── tools/ ← shared utilities (io, dedup, sampling, split,
│ query_gemini, push_to_hf, meta_data)
├── geminon_curation/ ← Geminon pipeline — 8 numbered stages,
│ templates/, committed reference Pokémon CSVs
└── news_curation/ ← News pipeline — 19 numbered stages, templates/
For per-pipeline quick starts, stage tables, and output schemas, read the per-pipeline READMEs:
pip install -r requirements.txtBoth pipelines run on the same env. Heavy GPU stages (news embeddings + clustering) need additional deps (sentence-transformers, torch, python-igraph, leidenalg); see news_curation/README.md for the full list.
For LLM querying you'll need Gemini API keys. The tools/query_gemini.py utility supports key rotation:
python -m tools.query_gemini \
--input some_prompts.jsonl \
--output some_responses.jsonl \
--api-keys $KEY1,$KEY2,$KEY3 \
--model gemini-2.5-pro \
--max-workers 32 --resumeFor HuggingFace pushes you'll need an HF token in HF_TOKEN.
After running a pipeline end-to-end, the output directory has the same eval-ready shape regardless of which pipeline produced it:
output/{version}/
├── corpus/
│ ├── large/
│ │ ├── all.jsonl ← relative symlink to the source slice file
│ │ ├── train.jsonl ← 90% (seeded shuffle)
│ │ ├── val.jsonl ← 5%
│ │ └── test.jsonl ← 5% (gets the rounding remainder)
│ ├── medium/ ← same 4 files
│ └── small/ ← same 4 files
├── qa/
│ ├── ...
│ └── (geminon: small/, medium/ ⎤ the same words as corpus/)
│ (news: final/filtered/ ⎦ with good_qas.jsonl + val.jsonl + test.jsonl)
└── stats/
├── token_counts_{slice}.json
├── token_dist_{slice}.png
├── token_dist_overlay.png
└── (per-pipeline summaries)
The corpus slices are uniformly named so downstream loaders don't have to know which pipeline produced the data. Train/val/test sums always match the source line count exactly (verified per release on real data).
The released versions live on HuggingFace at ContinuousBench/Geminon and ContinuousBench/News. Per-config splits and sizes are documented on each dataset card.
from datasets import load_dataset
load_dataset("ContinuousBench/Geminon", "corpus_large", split="train")
load_dataset("ContinuousBench/Geminon", "qa_small", split="public_val")
load_dataset("ContinuousBench/News", "corpus_large", split="train")
load_dataset("ContinuousBench/News", split="val") # qa is the default config
# Pin a specific release
load_dataset("ContinuousBench/Geminon", "index",
split="public", revision="2025_09")Publishing your own version. If you regenerate either dataset and want to host it on HuggingFace,
tools/push_to_hf.pyrenders a YAML-frontmatter dataset card withconfigs+ splits and (optionally) tags the commit with your version label. Pass--repo your-org/your-datasetto override the default upload target. Seepython -m tools.push_to_hf --helpfor the full flag set (--public,--skip-tag,--skip-qa,--dry-run, …).tools/meta_data.pydoes the matching Croissant + RAI merge.
| File | Contents |
|---|---|
| README.md (this file) | Project overview, repo layout, conventions |
| geminon_curation/README.md | Geminon pipeline quick start, stages, output schema, config |
| news_curation/README.md | News pipeline quick start, stages, output schema, config |
| requirements.txt | Combined Python deps for both pipelines + tools |