Skip to content

Question: long horizon tension stress tests as a complementary AGI evaluation axis #82

@onestardao

Description

@onestardao

Hi and thanks for open sourcing the ARC AGI benchmarking work.
It is one of the few efforts that seriously tries
to measure general intelligence rather than just “more benchmarks”.

I am coming from a slightly different angle
and wanted to ask for your opinion.

I maintain an open source framework called WFGY,
and the recent version “WFGY 3.0 · Singularity Demo”
is a pure TXT pack meant as a long horizon stress test.
It is a set of 131 S class open problems
(math, physics, alignment, social systems and more)
encoded as a BlackHole style test file that any LLM can read.

The goal is not to claim that the model “solves” these problems.
The goal is to scan how the model behaves when it is forced to:

carry a large conceptual load over many turns

stay consistent under high semantic tension

avoid collapse into vague handwaving or contradictions

So the question I want to ask is:

Do you see value in a complementary AGI evaluation axis
that focuses on “long horizon tension stability”
instead of clean single step task accuracy?

In other words,
if ARC style tasks test pattern completion under strict constraints,
this TXT pack tests whether the reasoning structure itself
stays coherent in a hostile environment of very hard problems.

I am not asking you to endorse WFGY.
I am only trying to see
whether this idea of a public, text only tension crash test
fits into how you think about AGI evaluation in the long term.

If it sounds worth a closer look,
I am happy to share more concrete examples
of how we drive the TXT pack in practice
and how we try to summarize the failure patterns.

Thanks again for your time and for pushing the field
toward serious evaluation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions