RULER evaluation of gpt-oss-120b model #939
Unanswered
cizekmilan
asked this question in
Q&A
Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I’m experimenting with RULER long-context evaluation using
nemo-evaluatorand a locally hosted openai/gpt-oss-120b model (served via vLLM with OpenAI-compatible API).However, I’m observing a behavior that seems counterintuitive, and I’d like to ask whether I’m misconfiguring something or misunderstanding the benchmark.
Setup
openai/gpt-oss-120b(served locally via vLLM)/v1/chat/completions)nemo-evaluatorniah_single_1(needle-in-a-haystack)Command:
Observed Results
Accuracy increases with context length, which seems unintuitive:
This trend is consistent even with higher sample counts (e.g., 500).
Questions
Is this expected behavior for RULER / NIAH tasks?
Does RULER:
Could this be caused by misconfiguration on my side?
--model_type completionswith a/chat/completionsendpoint?Is there a recommended way to:
Hypothesis
My suspicion is that:
But I’m not sure if this is expected or indicates a setup issue.
Any guidance or clarification would be greatly appreciated 🙏
Beta Was this translation helpful? Give feedback.
All reactions