Add BABILong results for Aegyx 0.1 by Voresot · Pull Request #17 · booydar/babilong

Voresot · 2026-05-28T17:33:37Z

Summary

This PR adds BABILong evaluation results for Aegyx 0.1, a closed research prototype of the Aegyx system.

Aegyx 0.1 is evaluated on the official public BABILong QA1-QA5 splits across the available context lengths from 0k to 10M.

Results

Using the unmodified BABILong official collector/scorer:

Tasks: qa1, qa2, qa3, qa4, qa5
Context lengths: 0k, 1k, 2k, 4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k, 1M, 10M
Average accuracy: 100% at every listed context length
10M average accuracy: 100%

Evaluation

The submitted files include:

raw prediction CSVs in the standard BABILong eval format
generated result table for Aegyx 0.1
generated result plot/PDF
scoring manifest with repository commit and scorer hashes

The official BABILong scorer was not modified.

Model / System Note

Aegyx 0.1 is a closed research prototype. This submission reports the evaluated long-context behavior only. Implementation details are not disclosed in this submission.

Files

babilong_evals/aegyx/Aegyx 0.1/*.csv
babilong_results/Aegyx 0.1.csv
babilong_results/Aegyx 0.1.pdf
babilong_results/Aegyx 0.1_heatmap.pdf
babilong_results/Aegyx 0.1_manifest.json

booydar · 2026-06-01T17:51:02Z

Thanks for submitting your evaluation results! Could you please share the access to the model and the evaluation code to reproduce the results? @Voresot

Voresot · 2026-06-02T19:41:33Z

Thanks for checking.

Aegyx 0.1 is a closed research prototype, so we cannot publicly release model weights or proprietary implementation code.

The submitted predictions were generated by Aegyx 0.1 inference and scored with the unmodified official BABILong collector/scorer. We can provide the raw prediction CSVs, result tables, scorer hashes, run manifests, and artifact hashes for the submitted evaluation.

For independent verification, we can support a black-box evaluation. You can provide a hidden/random BABILong subset or selected official rows, and we will run Aegyx 0.1 on them and return the raw predictions together with the corresponding hashes/manifests.

If needed, we can also discuss a limited private verification session without disclosing the closed prototype internals.

Voresot added 2 commits May 28, 2026 20:32

Add BABILong results for Aegyx 0.1

e63f066

Add BABILong scorer hashes for Aegyx 0.1

febaab4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BABILong results for Aegyx 0.1#17

Add BABILong results for Aegyx 0.1#17
Voresot wants to merge 2 commits into
booydar:feat/babilong_evals_hffrom
Voresot:codex/add-aegyx-0-1-babilong-results

Voresot commented May 28, 2026

Uh oh!

booydar commented Jun 1, 2026

Uh oh!

Voresot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Voresot commented May 28, 2026

Summary

Results

Evaluation

Model / System Note

Files

Uh oh!

booydar commented Jun 1, 2026

Uh oh!

Voresot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants