Standalone native G-Eval benchmark package for Eval Protocol.
This repo includes:
- native G-Eval reward implementation (
geval/reward.py) - SummEval-aligned benchmark entry (
benchmarks/test_geval.py) - SummEval sample dataset (
data/summeval_sample_100.jsonl) - dataset preparation script (
geval/prepare_data.py)
uv sync --extra devSet one key:
export OPENAI_API_KEY=...
# or Fireworks (OpenAI-compatible)
export FIREWORKS_API_KEY=...uv run ep local-test --entry benchmarks/test_geval.py::test_geval_benchmark --ignore-dockeror
uv run pytest benchmarks/test_geval.py::test_geval_benchmark -q -sGEVAL_DATA_PATH: path to JSONL dataset (default:data/summeval_sample_100.jsonl)GEVAL_DIMENSION:coherence|consistency|relevance|fluency(default:consistency)GEVAL_JUDGE_MODEL: judge model id (default:gpt-4.1-mini)GEVAL_TOP_LOGPROBS: top logprobs count (default:5)GEVAL_SAMPLING_FALLBACK_N: sampling fallback count when logprobs unsupported (default:20)
uv run geval-build-data --limit 100 --output data/summeval_sample_100.jsonl- The reward computes a discrete score and, when possible, a probability-weighted expected score.
- Probability weighting prefers token logprobs and falls back to sampling-based estimation.