Add BABILong results for Aegyx 0.1#17
Conversation
|
Thanks for submitting your evaluation results! Could you please share the access to the model and the evaluation code to reproduce the results? @Voresot |
|
Thanks for checking. Aegyx 0.1 is a closed research prototype, so we cannot publicly release model weights or proprietary implementation code. The submitted predictions were generated by Aegyx 0.1 inference and scored with the unmodified official BABILong collector/scorer. We can provide the raw prediction CSVs, result tables, scorer hashes, run manifests, and artifact hashes for the submitted evaluation. For independent verification, we can support a black-box evaluation. You can provide a hidden/random BABILong subset or selected official rows, and we will run Aegyx 0.1 on them and return the raw predictions together with the corresponding hashes/manifests. If needed, we can also discuss a limited private verification session without disclosing the closed prototype internals. |
Summary
This PR adds BABILong evaluation results for Aegyx 0.1, a closed research prototype of the Aegyx system.
Aegyx 0.1 is evaluated on the official public BABILong QA1-QA5 splits across the available context lengths from
0kto10M.Results
Using the unmodified BABILong official collector/scorer:
qa1,qa2,qa3,qa4,qa50k,1k,2k,4k,8k,16k,32k,64k,128k,256k,512k,1M,10M100%at every listed context length100%Evaluation
The submitted files include:
The official BABILong scorer was not modified.
Model / System Note
Aegyx 0.1 is a closed research prototype. This submission reports the evaluated long-context behavior only. Implementation details are not disclosed in this submission.
Files
babilong_evals/aegyx/Aegyx 0.1/*.csvbabilong_results/Aegyx 0.1.csvbabilong_results/Aegyx 0.1.pdfbabilong_results/Aegyx 0.1_heatmap.pdfbabilong_results/Aegyx 0.1_manifest.json