IFBench Standalone

Standalone IFBench/IFEval benchmark package for Fireworks-hosted models.

This repo includes:

Constraint checkers and registries (IFEval + IFBench OOD constraints)
ifeval_partial_credit_reward scoring function
An eval_protocol-native benchmark test (@evaluation_test)
A hard-subset builder CLI to filter out easy rows
A standalone evaluator CLI (optional utility)

Quickstart

uv sync --extra dev

Set your API key:

export FIREWORKS_API_KEY=...

Run with `ep local-test` (Primary)

Run the benchmark on the bundled 50-row sample:

uv run ep local-test --entry benchmarks/test_ifeval.py::test_ifeval_benchmark --ignore-docker

Use a different dataset file:

IFBENCH_DATA_PATH=data/ifbench_test_sample.jsonl \
uv run ep local-test --entry benchmarks/test_ifeval.py::test_ifeval_benchmark --ignore-docker

Override model params (including disabling reasoning):

EP_COMPLETION_PARAMS='[{"model":"fireworks_ai/accounts/fireworks/models/kimi-k2p5","temperature":1.0,"max_tokens":2048,"reasoning_effort":"none"}]' \
uv run ep local-test --entry benchmarks/test_ifeval.py::test_ifeval_benchmark --ignore-docker

reasoning_effort=false can also be passed in EP_COMPLETION_PARAMS and is normalized by Fireworks.

Optional standalone CLI

If you do not want to use ep local-test, you can still run direct evaluation:

uv run ifbench-eval \
  --input-jsonl data/ifbench_test_sample.jsonl \
  --model accounts/fireworks/models/kimi-k2p5 \
  --route litellm \
  --temperature 1.0 \
  --max-tokens 2048

Disable reasoning/thinking:

uv run ifbench-eval \
  --input-jsonl data/ifbench_test_sample.jsonl \
  --model accounts/fireworks/models/kimi-k2p5 \
  --route litellm \
  --temperature 1.0 \
  --reasoning-effort none \
  --max-tokens 2048

--reasoning-effort false is accepted and normalized to none.

Build IFEval-Hard

Filter out rows where a baseline model is consistently strong:

uv run ifbench-build-hard \
  --hf-dataset allenai/IFBench_test \
  --hf-config default \
  --hf-split train \
  --baseline-model accounts/fireworks/models/kimi-k2p5 \
  --route litellm \
  --temperature 0 \
  --reasoning-effort none \
  --runs 2 \
  --easy-threshold 0.95 \
  --output data/ifbench_test_hard.jsonl

Then evaluate on the hard subset:

uv run ifbench-eval \
  --input-jsonl data/ifbench_test_hard.jsonl \
  --model accounts/fireworks/models/kimi-k2p5 \
  --route litellm \
  --temperature 0 \
  --reasoning-effort none \
  --max-tokens 2048

Tests

uv run pytest -q

Attribution

Constraint source files in ifbench/ are adapted from:

open-instruct/open_instruct/IFEvalG/ (IFEval constraints)
IFBench/ (IFBench OOD constraints)

Original license headers were preserved in copied files.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchmarks		benchmarks
data		data
ifbench		ifbench
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IFBench Standalone

Quickstart

Run with `ep local-test` (Primary)

Optional standalone CLI

Build IFEval-Hard

Tests

Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

IFBench Standalone

Quickstart

Run with ep local-test (Primary)

Optional standalone CLI

Build IFEval-Hard

Tests

Attribution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Run with `ep local-test` (Primary)

Packages