Hi and thanks for open sourcing the ARC AGI benchmarking work.
It is one of the few efforts that seriously tries
to measure general intelligence rather than just “more benchmarks”.
I am coming from a slightly different angle
and wanted to ask for your opinion.
I maintain an open source framework called WFGY,
and the recent version “WFGY 3.0 · Singularity Demo”
is a pure TXT pack meant as a long horizon stress test.
It is a set of 131 S class open problems
(math, physics, alignment, social systems and more)
encoded as a BlackHole style test file that any LLM can read.
The goal is not to claim that the model “solves” these problems.
The goal is to scan how the model behaves when it is forced to:
carry a large conceptual load over many turns
stay consistent under high semantic tension
avoid collapse into vague handwaving or contradictions
So the question I want to ask is:
Do you see value in a complementary AGI evaluation axis
that focuses on “long horizon tension stability”
instead of clean single step task accuracy?
In other words,
if ARC style tasks test pattern completion under strict constraints,
this TXT pack tests whether the reasoning structure itself
stays coherent in a hostile environment of very hard problems.
I am not asking you to endorse WFGY.
I am only trying to see
whether this idea of a public, text only tension crash test
fits into how you think about AGI evaluation in the long term.
If it sounds worth a closer look,
I am happy to share more concrete examples
of how we drive the TXT pack in practice
and how we try to summarize the failure patterns.
Thanks again for your time and for pushing the field
toward serious evaluation.
Hi and thanks for open sourcing the ARC AGI benchmarking work.
It is one of the few efforts that seriously tries
to measure general intelligence rather than just “more benchmarks”.
I am coming from a slightly different angle
and wanted to ask for your opinion.
I maintain an open source framework called WFGY,
and the recent version “WFGY 3.0 · Singularity Demo”
is a pure TXT pack meant as a long horizon stress test.
It is a set of 131 S class open problems
(math, physics, alignment, social systems and more)
encoded as a BlackHole style test file that any LLM can read.
The goal is not to claim that the model “solves” these problems.
The goal is to scan how the model behaves when it is forced to:
carry a large conceptual load over many turns
stay consistent under high semantic tension
avoid collapse into vague handwaving or contradictions
So the question I want to ask is:
Do you see value in a complementary AGI evaluation axis
that focuses on “long horizon tension stability”
instead of clean single step task accuracy?
In other words,
if ARC style tasks test pattern completion under strict constraints,
this TXT pack tests whether the reasoning structure itself
stays coherent in a hostile environment of very hard problems.
I am not asking you to endorse WFGY.
I am only trying to see
whether this idea of a public, text only tension crash test
fits into how you think about AGI evaluation in the long term.
If it sounds worth a closer look,
I am happy to share more concrete examples
of how we drive the TXT pack in practice
and how we try to summarize the failure patterns.
Thanks again for your time and for pushing the field
toward serious evaluation.