Skip to content

Question: how would you classify a long-horizon “tension” exam alongside ARC? #83

@onestardao

Description

@onestardao

Hi, first of all, huge respect for ARC and for the work you are doing here.
It is one of the few benchmarks that really forces models to show actual reasoning instead of pattern matching.

I am working on something that is not a direct competitor to ARC,
but feels like it lives in a neighboring corner of the space.

I built a pure-text exam called “WFGY 3.0 · Singularity Demo”.
It packages 131 S-class questions as one TXT pack that any LLM can read.
The idea is not to score a single puzzle.
Instead, it is to keep the model in a very high-tension conceptual field for many turns and see when its world model starts to drift, collapse, or contradict itself.

You can think of it as:

ARC: “can you solve this kind of abstraction and reasoning problem at all?”

WFGY 3.0 TXT: “after being exposed to a whole field of extreme problems,
can your internal story stay coherent or does it fall apart?”

My questions are:

From your perspective, is this kind of long-horizon, text-only “tension crash test”
something that should be considered part of AGI evaluation, or is it a separate exam class?

If someone wanted to relate results on this TXT exam to ARC,
what kind of evidence or protocol would you consider non-ridiculous?

Do you see any obvious reasons why this kind of stress test is a bad idea,
or things we should be very careful about before calling it “AGI relevant”?

For context, the TXT pack is open source in my main repo (WFGY, ~1.4k stars),
and has already been run by several LLMs as a self-contained exam.
I am not asking to “join the leaderboard”.
I am trying to understand how people who really care about ARC-style evaluation
would classify this type of test.

Happy to provide more details or traces if you are curious.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions