Question: long horizon tension stress tests as a complementary AGI evaluation axis

Hi and thanks for open sourcing the ARC AGI benchmarking work.
It is one of the few efforts that seriously tries
to measure general intelligence rather than just “more benchmarks”.

I am coming from a slightly different angle
and wanted to ask for your opinion.

I maintain an open source framework called WFGY,
and the recent version “WFGY 3.0 · Singularity Demo”
is a pure TXT pack meant as a long horizon stress test.
It is a set of 131 S class open problems
(math, physics, alignment, social systems and more)
encoded as a BlackHole style test file that any LLM can read.

The goal is not to claim that the model “solves” these problems.
The goal is to scan how the model behaves when it is forced to:

carry a large conceptual load over many turns

stay consistent under high semantic tension

avoid collapse into vague handwaving or contradictions

So the question I want to ask is:

Do you see value in a complementary AGI evaluation axis
that focuses on “long horizon tension stability”
instead of clean single step task accuracy?

In other words,
if ARC style tasks test pattern completion under strict constraints,
this TXT pack tests whether the reasoning structure itself
stays coherent in a hostile environment of very hard problems.

I am not asking you to endorse WFGY.
I am only trying to see
whether this idea of a public, text only tension crash test
fits into how you think about AGI evaluation in the long term.

If it sounds worth a closer look,
I am happy to share more concrete examples
of how we drive the TXT pack in practice
and how we try to summarize the failure patterns.

Thanks again for your time and for pushing the field
toward serious evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: long horizon tension stress tests as a complementary AGI evaluation axis #82

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Question: long horizon tension stress tests as a complementary AGI evaluation axis #82

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions