RULER evaluation of gpt-oss-120b model #939

cizekmilan · 2026-04-27T09:38:38Z

cizekmilan
Apr 27, 2026

Hi,

I’m experimenting with RULER long-context evaluation using nemo-evaluator and a locally hosted openai/gpt-oss-120b model (served via vLLM with OpenAI-compatible API).

However, I’m observing a behavior that seems counterintuitive, and I’d like to ask whether I’m misconfiguring something or misunderstanding the benchmark.

Setup

Model: openai/gpt-oss-120b (served locally via vLLM)
Endpoint: OpenAI-compatible (/v1/chat/completions)
Evaluator: nemo-evaluator
Task: niah_single_1 (needle-in-a-haystack)
Fixed random seed
Samples: tested both small (50) and larger (500)

Command:

MODEL_NAME="openai/gpt-oss-120b"

for ctx in 4k 8k 16k 32k 64k 128k; do
  echo "=== Running $ctx ==="

  nemo-evaluator run_eval \
    --eval_type ruler-${ctx}-completions \
    --model_id "$MODEL_NAME" \
    --model_type completions \
    --model_url http://my_server:8000/v1/chat/completions \
    --api_key_name OPENAI_API_KEY \
    --output_dir /workspace/results/${ctx} \
    --overrides "config.params.extra.tokenizer=$MODEL_NAME,config.params.extra.subtasks=niah_single_1"

  echo "=== DONE $ctx ==="
done

Observed Results

Accuracy increases with context length, which seems unintuitive:

task            4k     8k     16k    32k    64k    128k
niah_single_1   0.78   0.64   0.82   0.72   0.96   1.00

This trend is consistent even with higher sample counts (e.g., 500).

Questions

Is this expected behavior for RULER / NIAH tasks?
- I would expect accuracy to decrease or at least plateau with longer context.
Does RULER:
- generate different datasets per context length?
- vary the needle position (e.g., closer to the end in longer contexts)?
Could this be caused by misconfiguration on my side?
- Using --model_type completions with a /chat/completions endpoint?
- Missing overrides (e.g., tokenizer, max_seq_length, etc.)?
Is there a recommended way to:
- control needle position
- or ensure comparable samples across context sizes?

Hypothesis

My suspicion is that:

needle placement may not be controlled across context lengths
longer contexts might bias the needle toward later positions
which could make retrieval easier (recency bias)

But I’m not sure if this is expected or indicates a setup issue.

Any guidance or clarification would be greatly appreciated 🙏

sammysosa31534-cpu · 2026-05-13T00:04:36Z

sammysosa31534-cpu
May 13, 2026

#1007

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RULER evaluation of gpt-oss-120b model #939

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RULER evaluation of gpt-oss-120b model #939

Uh oh!

Uh oh!

cizekmilan Apr 27, 2026

Setup

Observed Results

Questions

Hypothesis

Replies: 1 comment

Uh oh!

sammysosa31534-cpu May 13, 2026

cizekmilan
Apr 27, 2026

sammysosa31534-cpu
May 13, 2026