Skip to content

eval-protocol/geval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

G-Eval Standalone

Standalone native G-Eval benchmark package for Eval Protocol.

This repo includes:

  • native G-Eval reward implementation (geval/reward.py)
  • SummEval-aligned benchmark entry (benchmarks/test_geval.py)
  • SummEval sample dataset (data/summeval_sample_100.jsonl)
  • dataset preparation script (geval/prepare_data.py)

Quickstart

uv sync --extra dev

Set one key:

export OPENAI_API_KEY=...
# or Fireworks (OpenAI-compatible)
export FIREWORKS_API_KEY=...

Run benchmark

uv run ep local-test --entry benchmarks/test_geval.py::test_geval_benchmark --ignore-docker

or

uv run pytest benchmarks/test_geval.py::test_geval_benchmark -q -s

Key env vars

  • GEVAL_DATA_PATH: path to JSONL dataset (default: data/summeval_sample_100.jsonl)
  • GEVAL_DIMENSION: coherence|consistency|relevance|fluency (default: consistency)
  • GEVAL_JUDGE_MODEL: judge model id (default: gpt-4.1-mini)
  • GEVAL_TOP_LOGPROBS: top logprobs count (default: 5)
  • GEVAL_SAMPLING_FALLBACK_N: sampling fallback count when logprobs unsupported (default: 20)

Refresh dataset

uv run geval-build-data --limit 100 --output data/summeval_sample_100.jsonl

Notes

  • The reward computes a discrete score and, when possible, a probability-weighted expected score.
  • Probability weighting prefers token logprobs and falls back to sampling-based estimation.

About

Native G-Eval benchmark for Eval Protocol

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages