Standalone IFBench/IFEval benchmark package for Fireworks-hosted models.
This repo includes:
- Constraint checkers and registries (IFEval + IFBench OOD constraints)
ifeval_partial_credit_rewardscoring function- An eval_protocol-native benchmark test (
@evaluation_test) - A hard-subset builder CLI to filter out easy rows
- A standalone evaluator CLI (optional utility)
uv sync --extra devSet your API key:
export FIREWORKS_API_KEY=...Run the benchmark on the bundled 50-row sample:
uv run ep local-test --entry benchmarks/test_ifeval.py::test_ifeval_benchmark --ignore-dockerUse a different dataset file:
IFBENCH_DATA_PATH=data/ifbench_test_sample.jsonl \
uv run ep local-test --entry benchmarks/test_ifeval.py::test_ifeval_benchmark --ignore-dockerOverride model params (including disabling reasoning):
EP_COMPLETION_PARAMS='[{"model":"fireworks_ai/accounts/fireworks/models/kimi-k2p5","temperature":1.0,"max_tokens":2048,"reasoning_effort":"none"}]' \
uv run ep local-test --entry benchmarks/test_ifeval.py::test_ifeval_benchmark --ignore-dockerreasoning_effort=false can also be passed in EP_COMPLETION_PARAMS and is normalized by Fireworks.
If you do not want to use ep local-test, you can still run direct evaluation:
uv run ifbench-eval \
--input-jsonl data/ifbench_test_sample.jsonl \
--model accounts/fireworks/models/kimi-k2p5 \
--route litellm \
--temperature 1.0 \
--max-tokens 2048Disable reasoning/thinking:
uv run ifbench-eval \
--input-jsonl data/ifbench_test_sample.jsonl \
--model accounts/fireworks/models/kimi-k2p5 \
--route litellm \
--temperature 1.0 \
--reasoning-effort none \
--max-tokens 2048--reasoning-effort false is accepted and normalized to none.
Filter out rows where a baseline model is consistently strong:
uv run ifbench-build-hard \
--hf-dataset allenai/IFBench_test \
--hf-config default \
--hf-split train \
--baseline-model accounts/fireworks/models/kimi-k2p5 \
--route litellm \
--temperature 0 \
--reasoning-effort none \
--runs 2 \
--easy-threshold 0.95 \
--output data/ifbench_test_hard.jsonlThen evaluate on the hard subset:
uv run ifbench-eval \
--input-jsonl data/ifbench_test_hard.jsonl \
--model accounts/fireworks/models/kimi-k2p5 \
--route litellm \
--temperature 0 \
--reasoning-effort none \
--max-tokens 2048uv run pytest -qConstraint source files in ifbench/ are adapted from:
open-instruct/open_instruct/IFEvalG/(IFEval constraints)IFBench/(IFBench OOD constraints)
Original license headers were preserved in copied files.