-
Notifications
You must be signed in to change notification settings - Fork 0
fix(m3): repair harness bugs that artificially zeroed CUGA M3 pass rate #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
c0ce9f1
fix(m3): fix harness bugs that artificially zeroed CUGA M3 pass rate
haroldship 58aabfe
fix(m3): mark P-OF-2 frontmatter as disabled to match filename
haroldship 455315e
Merge branch 'main' into fix/m3-harness-bugs
haroldship 329b198
fix(m3): use DYNACONF_SERVER_PORTS__REGISTRY for registry bind and agent
haroldship ab5ccc4
fix(m3): auto-sequence capability passes when --m3-data has no --capa…
haroldship 3b0276c
Fix M3 bundle assembly and eval harness reliability.
haroldship de8ddd4
Fix create_eval_bundle import error when run as a script.
haroldship 4947947
Fix bundle CLI when invoked outside the benchmarks package.
haroldship a300258
Fix CI failures from polluted eval env and bandit B108.
haroldship 2eb12dd
fix(m3): one eval run = one result file + one trajectory run (all tasks)
haroldship 1852ea5
fix(m3): make sequential per-domain registry restarts reliable on the…
haroldship cf4303e
fix(m3): add capability/domain/task# report columns + per-run bundle …
haroldship f4505c0
fix(m3): tune eval env and add defensive tool-output instructions
haroldship 2ce23f5
Merge remote-tracking branch 'origin/main' into fix/m3-harness-bugs
haroldship ab3b4ef
fix(m3): single Langfuse trace per task on Watsonx/Cuga path
haroldship 041af8b
fix(m3): gate Langfuse on settings and harden eval invoke fallbacks
haroldship 671046c
Merge remote-tracking branch 'origin/main' into fix/m3-harness-bugs
haroldship f54f4a5
fix(m3): export should_trace_langfuse_task from benchmarks.helpers
haroldship 7676764
fix(m3): wire --no-policies through compare and eval.sh
haroldship 790518c
fix: address CodeRabbit review findings on PR #3
haroldship 4049fb0
fix: address Sergey's and Offer's review findings on PR #3
haroldship eeea391
fix(m3): propagate --capability filter to react agent's registry expa…
haroldship acb0031
chore(m3): regenerate compiled policies.json from markdown sources
haroldship File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code tries to detect whether log files are organized as "one list per run" or "one flat list" by checking if the first element is itself a list. If someone accidentally passes an empty nested structure like [[]], the code won't crash immediately but might behave unexpectedly—it'll try to copy logs from an empty group, which could silently fail or do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked this —
[[]]doesn't actually misbehave:_copy_logsearly-returnsFalseon a falsy/emptylog_fileslist (line 250:if not log_files: return False), so the inner empty group is just a no-op — no silent partial copy, no crash. The shape-detection branch correctly routes it to the grouped path and then does nothing for the empty group. Leaving as-is; the existing guard already covers this.