Skip to content

Add BABILong results for Aegyx 0.1#17

Open
Voresot wants to merge 2 commits into
booydar:feat/babilong_evals_hffrom
Voresot:codex/add-aegyx-0-1-babilong-results
Open

Add BABILong results for Aegyx 0.1#17
Voresot wants to merge 2 commits into
booydar:feat/babilong_evals_hffrom
Voresot:codex/add-aegyx-0-1-babilong-results

Conversation

@Voresot

@Voresot Voresot commented May 28, 2026

Copy link
Copy Markdown

Summary

This PR adds BABILong evaluation results for Aegyx 0.1, a closed research prototype of the Aegyx system.

Aegyx 0.1 is evaluated on the official public BABILong QA1-QA5 splits across the available context lengths from 0k to 10M.

Results

Using the unmodified BABILong official collector/scorer:

  • Tasks: qa1, qa2, qa3, qa4, qa5
  • Context lengths: 0k, 1k, 2k, 4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k, 1M, 10M
  • Average accuracy: 100% at every listed context length
  • 10M average accuracy: 100%

Evaluation

The submitted files include:

  • raw prediction CSVs in the standard BABILong eval format
  • generated result table for Aegyx 0.1
  • generated result plot/PDF
  • scoring manifest with repository commit and scorer hashes

The official BABILong scorer was not modified.

Model / System Note

Aegyx 0.1 is a closed research prototype. This submission reports the evaluated long-context behavior only. Implementation details are not disclosed in this submission.

Files

  • babilong_evals/aegyx/Aegyx 0.1/*.csv
  • babilong_results/Aegyx 0.1.csv
  • babilong_results/Aegyx 0.1.pdf
  • babilong_results/Aegyx 0.1_heatmap.pdf
  • babilong_results/Aegyx 0.1_manifest.json

@booydar

booydar commented Jun 1, 2026

Copy link
Copy Markdown
Owner

Thanks for submitting your evaluation results! Could you please share the access to the model and the evaluation code to reproduce the results? @Voresot

@Voresot

Voresot commented Jun 2, 2026

Copy link
Copy Markdown
Author

Thanks for checking.

Aegyx 0.1 is a closed research prototype, so we cannot publicly release model weights or proprietary implementation code.

The submitted predictions were generated by Aegyx 0.1 inference and scored with the unmodified official BABILong collector/scorer. We can provide the raw prediction CSVs, result tables, scorer hashes, run manifests, and artifact hashes for the submitted evaluation.

For independent verification, we can support a black-box evaluation. You can provide a hidden/random BABILong subset or selected official rows, and we will run Aegyx 0.1 on them and return the raw predictions together with the corresponding hashes/manifests.

If needed, we can also discuss a limited private verification session without disclosing the closed prototype internals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants