ContinuousBenchCuration

End-to-end curation pipelines for the ContinuousBench benchmark family. Each pipeline ingests raw source data, runs deduplication and clustering, generates question-answer pairs through a multi-stage LLM workflow, and produces eval-ready corpus slices and QA splits that can be pushed to HuggingFace as proper datasets with config + revision tags.

Two pipelines live here:

geminon_curation/ — synthetic dataset of 600 fictional Pokémon-like creatures with stats, names, ~1.5M LLM-generated corpus articles, and 8,400 factual QA pairs. Source data is fully synthetic so the pipeline runs offline (no LLM calls past the naming + corpus generation stages).
news_curation/ — news QA dataset built from a Common Crawl News dump. Pulls WARCs, extracts articles via trafilatura, dedupes, clusters events with windowed kNN + Leiden, then runs a 5-stage LLM workflow (fact extraction → QA generation → zero-shot eval → judge → support check + open-book) to produce factual QAs grounded in real news articles.

Both pipelines share the same design principles and were intentionally written in parallel:

Numbered stage scripts. Each stage reads its input, writes its output, and can be re-run independently. No hidden state, no orchestration framework.
Decoupled LLM calls. Every script either generates prompts (saved as JSONL) or processes responses. The actual API calls happen in tools/query_gemini.py, which can be replaced with any batch system. This means you can stop after a save-prompts stage, send the prompts through your own infrastructure, and resume.
Versioned outputs. Each pipeline writes to output/{version}/.... Bumping version in config.yaml creates a fresh dataset version without touching the old one.
Config-driven. All version-specific knobs (seeds, dedup thresholds, model choices, sample sizes, prompt templates) live in config.yaml. No hardcoded paths or magic numbers in the stage scripts.

Repo layout

ContinuousBenchCuration/
├── README.md                ← you are here
├── requirements.txt         ← combined deps for both pipelines + tools/
├── tools/                   ← shared utilities (io, dedup, sampling, split,
│                              query_gemini, push_to_hf, meta_data)
├── geminon_curation/        ← Geminon pipeline — 8 numbered stages,
│                              templates/, committed reference Pokémon CSVs
└── news_curation/           ← News pipeline — 19 numbered stages, templates/

For per-pipeline quick starts, stage tables, and output schemas, read the per-pipeline READMEs:

Setup

pip install -r requirements.txt

Both pipelines run on the same env. Heavy GPU stages (news embeddings + clustering) need additional deps (sentence-transformers, torch, python-igraph, leidenalg); see news_curation/README.md for the full list.

For LLM querying you'll need Gemini API keys. The tools/query_gemini.py utility supports key rotation:

python -m tools.query_gemini \
    --input some_prompts.jsonl \
    --output some_responses.jsonl \
    --api-keys $KEY1,$KEY2,$KEY3 \
    --model gemini-2.5-pro \
    --max-workers 32 --resume

For HuggingFace pushes you'll need an HF token in HF_TOKEN.

Output structure

After running a pipeline end-to-end, the output directory has the same eval-ready shape regardless of which pipeline produced it:

output/{version}/
├── corpus/
│   ├── large/
│   │   ├── all.jsonl          ← relative symlink to the source slice file
│   │   ├── train.jsonl        ← 90% (seeded shuffle)
│   │   ├── val.jsonl          ← 5%
│   │   └── test.jsonl         ← 5% (gets the rounding remainder)
│   ├── medium/                ← same 4 files
│   └── small/                 ← same 4 files
├── qa/
│   ├── ...
│   └── (geminon: small/, medium/  ⎤  the same words as corpus/)
│       (news: final/filtered/    ⎦  with good_qas.jsonl + val.jsonl + test.jsonl)
└── stats/
    ├── token_counts_{slice}.json
    ├── token_dist_{slice}.png
    ├── token_dist_overlay.png
    └── (per-pipeline summaries)

The corpus slices are uniformly named so downstream loaders don't have to know which pipeline produced the data. Train/val/test sums always match the source line count exactly (verified per release on real data).

Using the released datasets

The released versions live on HuggingFace at ContinuousBench/Geminon and ContinuousBench/News. Per-config splits and sizes are documented on each dataset card.

from datasets import load_dataset

load_dataset("ContinuousBench/Geminon", "corpus_large", split="train")
load_dataset("ContinuousBench/Geminon", "qa_small",     split="public_val")

load_dataset("ContinuousBench/News",    "corpus_large", split="train")
load_dataset("ContinuousBench/News",    split="val")      # qa is the default config

# Pin a specific release
load_dataset("ContinuousBench/Geminon", "index",
             split="public", revision="2025_09")

Publishing your own version. If you regenerate either dataset and want to host it on HuggingFace, tools/push_to_hf.py renders a YAML-frontmatter dataset card with configs + splits and (optionally) tags the commit with your version label. Pass --repo your-org/your-dataset to override the default upload target. See python -m tools.push_to_hf --help for the full flag set (--public, --skip-tag, --skip-qa, --dry-run, …). tools/meta_data.py does the matching Croissant + RAI merge.

Documentation

File	Contents
README.md (this file)	Project overview, repo layout, conventions
geminon_curation/README.md	Geminon pipeline quick start, stages, output schema, config
news_curation/README.md	News pipeline quick start, stages, output schema, config
requirements.txt	Combined Python deps for both pipelines + tools

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ContinuousBenchCuration

Repo layout

Setup

Output structure

Using the released datasets

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
geminon_curation		geminon_curation
news_curation		news_curation
tools		tools
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ContinuousBenchCuration

Repo layout

Setup

Output structure

Using the released datasets

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages