Skip to content

eval-protocol/ifbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IFBench Standalone

Standalone IFBench/IFEval benchmark package for Fireworks-hosted models.

This repo includes:

  • Constraint checkers and registries (IFEval + IFBench OOD constraints)
  • ifeval_partial_credit_reward scoring function
  • An eval_protocol-native benchmark test (@evaluation_test)
  • A hard-subset builder CLI to filter out easy rows
  • A standalone evaluator CLI (optional utility)

Quickstart

uv sync --extra dev

Set your API key:

export FIREWORKS_API_KEY=...

Run with ep local-test (Primary)

Run the benchmark on the bundled 50-row sample:

uv run ep local-test --entry benchmarks/test_ifeval.py::test_ifeval_benchmark --ignore-docker

Use a different dataset file:

IFBENCH_DATA_PATH=data/ifbench_test_sample.jsonl \
uv run ep local-test --entry benchmarks/test_ifeval.py::test_ifeval_benchmark --ignore-docker

Override model params (including disabling reasoning):

EP_COMPLETION_PARAMS='[{"model":"fireworks_ai/accounts/fireworks/models/kimi-k2p5","temperature":1.0,"max_tokens":2048,"reasoning_effort":"none"}]' \
uv run ep local-test --entry benchmarks/test_ifeval.py::test_ifeval_benchmark --ignore-docker

reasoning_effort=false can also be passed in EP_COMPLETION_PARAMS and is normalized by Fireworks.

Optional standalone CLI

If you do not want to use ep local-test, you can still run direct evaluation:

uv run ifbench-eval \
  --input-jsonl data/ifbench_test_sample.jsonl \
  --model accounts/fireworks/models/kimi-k2p5 \
  --route litellm \
  --temperature 1.0 \
  --max-tokens 2048

Disable reasoning/thinking:

uv run ifbench-eval \
  --input-jsonl data/ifbench_test_sample.jsonl \
  --model accounts/fireworks/models/kimi-k2p5 \
  --route litellm \
  --temperature 1.0 \
  --reasoning-effort none \
  --max-tokens 2048

--reasoning-effort false is accepted and normalized to none.

Build IFEval-Hard

Filter out rows where a baseline model is consistently strong:

uv run ifbench-build-hard \
  --hf-dataset allenai/IFBench_test \
  --hf-config default \
  --hf-split train \
  --baseline-model accounts/fireworks/models/kimi-k2p5 \
  --route litellm \
  --temperature 0 \
  --reasoning-effort none \
  --runs 2 \
  --easy-threshold 0.95 \
  --output data/ifbench_test_hard.jsonl

Then evaluate on the hard subset:

uv run ifbench-eval \
  --input-jsonl data/ifbench_test_hard.jsonl \
  --model accounts/fireworks/models/kimi-k2p5 \
  --route litellm \
  --temperature 0 \
  --reasoning-effort none \
  --max-tokens 2048

Tests

uv run pytest -q

Attribution

Constraint source files in ifbench/ are adapted from:

  • open-instruct/open_instruct/IFEvalG/ (IFEval constraints)
  • IFBench/ (IFBench OOD constraints)

Original license headers were preserved in copied files.

About

Add IFBench, source: https://github.com/allenai/IFBench

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages