Skip to content

[codex] add agent eval failure packet#10

Merged
yangfei222666-9 merged 2 commits into
mainfrom
codex/agent-eval-data-packet
Apr 29, 2026
Merged

[codex] add agent eval failure packet#10
yangfei222666-9 merged 2 commits into
mainfrom
codex/agent-eval-data-packet

Conversation

@yangfei222666-9
Copy link
Copy Markdown
Owner

Summary

Adds a concrete, machine-checkable agent eval data packet to self-improving-loop:

  • docs/ANNOTATION_GUIDELINE.md defines failure labels, hard vs soft signals, and routing policy.
  • docs/AI_CODING_TOOL_FAILURE_NOTES.md maps Claude Code / Cursor / OpenClaw / Hermes failure modes into labels and guard actions.
  • examples/agent_eval_cases.jsonl adds 30 non-authorizing eval cases.
  • examples/verify_agent_eval_cases.py verifies schema, labels, unique case IDs, and hard no-execution flags.
  • tests/test_agent_eval_cases.py locks the packet boundary in CI.

Why

This turns the project from only a regression-rollback runtime into a clearer reliability data asset: traces can now be mapped into failure labels, eval cases, and conservative routing without granting execution authority.

Safety boundary

The packet is eval data only. The verifier enforces:

judgment_allowed=false
paper_buy_allowed=false
trade_allowed=false
promote_allowed=false

Validation

python3 examples/verify_agent_eval_cases.py examples/agent_eval_cases.jsonl
python3 -m py_compile examples/verify_agent_eval_cases.py
python3 -m pytest -q tests/test_agent_eval_cases.py tests/test_examples.py
python3 -m pytest -q

Results:

verdict=ok
case_count=30
10 passed
66 passed

@yangfei222666-9 yangfei222666-9 marked this pull request as ready for review April 29, 2026 12:25
@yangfei222666-9 yangfei222666-9 force-pushed the codex/agent-eval-data-packet branch from 5089400 to b6b70a3 Compare April 29, 2026 12:29
@yangfei222666-9 yangfei222666-9 merged commit bedb1b3 into main Apr 29, 2026
12 checks passed
@yangfei222666-9 yangfei222666-9 deleted the codex/agent-eval-data-packet branch April 29, 2026 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant