Skip to content

Add metrics example and update requirements#6

Open
taidnguyen wants to merge 1 commit intoallenai:mainfrom
taidnguyen:main
Open

Add metrics example and update requirements#6
taidnguyen wants to merge 1 commit intoallenai:mainfrom
taidnguyen:main

Conversation

@taidnguyen
Copy link
Copy Markdown
Contributor

@taidnguyen taidnguyen commented Mar 29, 2026

  • Add example for calculating metrics, with validation that HF checkpoints produce the same logits as released results.
  • Fix ai2-olmo version in requirements.txt
tainguyen7597@Mac datadecide-fork % python examples/compute_metrics.py  

Doc ID: Mercury_417466
correct_choice: predicted=0, reference=0
acc_raw: predicted=0, reference=0
acc_per_token: predicted=1, reference=1
acc_per_char: predicted=1, reference=1
- sum_logits: pred=-33.75, ref=-33.75
- num_tokens: pred=12.00, ref=12.00
- num_chars: pred=60.00, ref=60.00
- sum_logits: pred=-36.50, ref=-36.50
- num_tokens: pred=11.00, ref=11.00
- num_chars: pred=55.00, ref=55.00
- sum_logits: pred=-29.18, ref=-29.18
- num_tokens: pred=9.00, ref=9.00
- num_chars: pred=51.00, ref=51.00
- sum_logits: pred=-38.65, ref=-38.65
- num_tokens: pred=8.00, ref=8.00
- num_chars: pred=44.00, ref=44.00
----------------------------------------

Doc ID: Mercury_7081673
correct_choice: predicted=1, reference=1
acc_raw: predicted=1, reference=1
acc_per_token: predicted=0, reference=0
acc_per_char: predicted=1, reference=1
- sum_logits: pred=-12.96, ref=-12.96
- num_tokens: pred=4.00, ref=4.00
- num_chars: pred=15.00, ref=15.00
- sum_logits: pred=-10.86, ref=-10.86
- num_tokens: pred=2.00, ref=2.00
- num_chars: pred=15.00, ref=15.00
- sum_logits: pred=-11.42, ref=-11.42
- num_tokens: pred=2.00, ref=2.00
- num_chars: pred=14.00, ref=14.00
- sum_logits: pred=-19.42, ref=-19.42
- num_tokens: pred=3.00, ref=3.00
- num_chars: pred=11.00, ref=11.00
----------------------------------------

Doc ID: Mercury_7239733
correct_choice: predicted=3, reference=3
acc_raw: predicted=0, reference=0
acc_per_token: predicted=0, reference=0
acc_per_char: predicted=0, reference=0
- sum_logits: pred=-14.24, ref=-14.24
- num_tokens: pred=2.00, ref=2.00
- num_chars: pred=12.00, ref=12.00
- sum_logits: pred=-14.39, ref=-14.39
- num_tokens: pred=2.00, ref=2.00
- num_chars: pred=11.00, ref=11.00
- sum_logits: pred=-12.40, ref=-12.40
- num_tokens: pred=2.00, ref=2.00
- num_chars: pred=13.00, ref=13.00
- sum_logits: pred=-14.41, ref=-14.41
- num_tokens: pred=2.00, ref=2.00
- num_chars: pred=12.00, ref=12.00
----------------------------------------

Doc ID: NYSEDREGENTS_2015_4_8
correct_choice: predicted=3, reference=3
acc_raw: predicted=1, reference=1
acc_per_token: predicted=1, reference=1
acc_per_char: predicted=1, reference=1
- sum_logits: pred=-10.33, ref=-10.33
- num_tokens: pred=1.00, ref=1.00
- num_chars: pred=5.00, ref=5.00
- sum_logits: pred=-8.12, ref=-8.12
- num_tokens: pred=1.00, ref=1.00
- num_chars: pred=5.00, ref=5.00
- sum_logits: pred=-6.02, ref=-6.02
- num_tokens: pred=1.00, ref=1.00
- num_chars: pred=5.00, ref=5.00
- sum_logits: pred=-5.65, ref=-5.65
- num_tokens: pred=1.00, ref=1.00
- num_chars: pred=5.00, ref=5.00
----------------------------------------

Doc ID: Mercury_7037258
correct_choice: predicted=1, reference=1
acc_raw: predicted=0, reference=0
acc_per_token: predicted=0, reference=0
acc_per_char: predicted=0, reference=0
- sum_logits: pred=-41.32, ref=-41.32
- num_tokens: pred=7.00, ref=7.00
- num_chars: pred=51.00, ref=51.00
- sum_logits: pred=-39.29, ref=-39.29
- num_tokens: pred=9.00, ref=9.00
- num_chars: pred=58.00, ref=58.00
- sum_logits: pred=-30.94, ref=-30.94
- num_tokens: pred=8.00, ref=8.00
- num_chars: pred=57.00, ref=57.00
- sum_logits: pred=-44.94, ref=-44.94
- num_tokens: pred=10.00, ref=10.00
- num_chars: pred=64.00, ref=64.00
----------------------------------------

@IanMagnusson would appreciate a review when you have chance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant