feat: add TokenUsageTracker callback for SDK mode#5
Closed
sami-marreed wants to merge 4 commits into
Closed
Conversation
Fixes #37 - Add SDKTokenUsageTrackerCallback for rich trajectory capture - Enable for all benchmarks (BPO, M3, Oak, AppWorld SDK) - SDK trajectories now include full LLM conversation prompts
- Upgrade langchain-core from 1.3.0 to 1.3.3 (fixes CVE-2026-44843) - Upgrade python-multipart from 0.0.26 to 0.0.28 (fixes CVE-2026-42561) - All security checks now pass (just security ✅)
Collaborator
Update
|
Collaborator
|
Status: Merged Ready for review when you are. |
3 tasks
Collaborator
|
Closing without merge — superseded by upstream fix. Upstream: cuga-agent#71 (closed) implemented in cuga-agent#236. Eval follow-up: Documentation tracked in #42 (CONTRIBUTING note + optional dep pins). Issues: This PR had no linked cuga-eval issue (only cross-repo cuga-agent#71). Nothing else to close here. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related to cuga-agent#71
Summary
Adds TokenUsageTracker-like functionality to SDK mode (CugaAgent) to standardize trajectory capture across all benchmarks. This enables SDK mode to produce trajectory files with the same prompt richness as AgentRunner mode.
Changes
Core Implementation
New callback handler (benchmarks/helpers/token_usage_tracker_callback.py)
Updated SDK helper (benchmarks/helpers/sdk_eval_helpers.py)
M3 evaluations updated to include callback:
Security Updates
Impact
Before
SDK mode trajectories had:
promptsfields on most stepsAfter
SDK mode trajectories now have:
promptsfieldsAffected Benchmarks
All 4 benchmarks now capture rich trajectory data:
setup_agent_with_tools()setup_agent_with_tools()setup_agent_with_tools()Testing
just lint- All checks passedjust security- No vulnerabilities foundDocumentation
See TOKENUSAGETRACKER_IMPLEMENTATION.md for complete implementation details, usage examples, and future work.
Notes
This is a workaround until upstream cuga-agent#71 is implemented to add native TokenUsageTracker support to CugaAgent SDK.