feat: add TokenUsageTracker callback for SDK mode by sami-marreed · Pull Request #5 · cuga-project/cuga-eval

sami-marreed · 2026-05-26T09:36:32Z

Related to cuga-agent#71

Note: the original description cited Fixes #37, which refers to an internal-tracker issue that was not migrated to this repo. Reference removed.

Summary

Adds TokenUsageTracker-like functionality to SDK mode (CugaAgent) to standardize trajectory capture across all benchmarks. This enables SDK mode to produce trajectory files with the same prompt richness as AgentRunner mode.

Changes

Core Implementation

New callback handler (benchmarks/helpers/token_usage_tracker_callback.py)
- LangChain callback that captures system/user/assistant prompts
- Mimics TokenUsageTracker behavior from cuga-agent's agent_loop.py
- Forwards captured data to ActivityTracker via callbacks
Updated SDK helper (benchmarks/helpers/sdk_eval_helpers.py)
- setup_agent_with_tools now automatically includes TokenUsageTracker callback
- New parameter: enable_token_usage_tracker (default: True)
- Backward compatible - can be disabled if needed
M3 evaluations updated to include callback:
- benchmarks/m3/eval_m3.py
- benchmarks/m3/eval_m3_task_1_enterprise_style.py

Security Updates

Upgraded langchain-core from 1.3.0 to 1.3.3 (fixes CVE-2026-44843)
Upgraded python-multipart from 0.0.26 to 0.0.28 (fixes CVE-2026-42561)

Impact

Before

SDK mode trajectories had:

Empty prompts fields on most steps
Minimal step granularity (5 steps)
Token tracking via Langfuse only

After

SDK mode trajectories now have:

Full LLM conversation history in prompts fields
System prompts, user prompts, and assistant responses for every LLM call
Same prompt richness as AgentRunner mode
Compatible with cuga-viz and trajectory analysis tools

Affected Benchmarks

All 4 benchmarks now capture rich trajectory data:

BPO - Automatic via setup_agent_with_tools()
M3 - Automatic + manual for direct CugaAgent creation
Oak Health Insurance - Automatic via setup_agent_with_tools()
AppWorld SDK - Automatic via setup_agent_with_tools()

Testing

just lint - All checks passed
just security - No vulnerabilities found
Pre-commit hooks passed

Documentation

See TOKENUSAGETRACKER_IMPLEMENTATION.md for complete implementation details, usage examples, and future work.

Notes

This is a workaround until upstream cuga-agent#71 is implemented to add native TokenUsageTracker support to CugaAgent SDK.

Fixes #37 - Add SDKTokenUsageTrackerCallback for rich trajectory capture - Enable for all benchmarks (BPO, M3, Oak, AppWorld SDK) - SDK trajectories now include full LLM conversation prompts

- Upgrade langchain-core from 1.3.0 to 1.3.3 (fixes CVE-2026-44843) - Upgrade python-multipart from 0.0.26 to 0.0.28 (fixes CVE-2026-42561) - All security checks now pass (just security ✅)

haroldship · 2026-06-03T19:17:58Z

Update

Merged latest main into feature/add-token-usage-tracker-to-sdk (resolved setup_agent_with_tools callback init conflict).
Local just ci: lint ✓, 266 tests ✓, security ✓.
Pushed to origin; CI should rerun shortly.

haroldship · 2026-06-03T19:45:32Z

Status: Merged main (e3f7438), conflict in setup_agent_with_tools resolved. All CI checks green. Mergeable.

Ready for review when you are.

haroldship · 2026-06-04T12:57:11Z

Closing without merge — superseded by upstream fix.

Upstream: cuga-agent#71 (closed) implemented in cuga-agent#236. CugaAgent now registers TokenUsageTracker internally; merging this PR would duplicate prompt/step capture.

Eval follow-up: Documentation tracked in #42 (CONTRIBUTING note + optional dep pins).

Issues: This PR had no linked cuga-eval issue (only cross-repo cuga-agent#71). Nothing else to close here.

haroldship added 4 commits May 26, 2026 12:32

feat: add TokenUsageTracker callback for SDK mode

a28d44a

Fixes #37 - Add SDKTokenUsageTrackerCallback for rich trajectory capture - Enable for all benchmarks (BPO, M3, Oak, AppWorld SDK) - SDK trajectories now include full LLM conversation prompts

chore(deps): update dependencies to fix security vulnerabilities

7065b50

- Upgrade langchain-core from 1.3.0 to 1.3.3 (fixes CVE-2026-44843) - Upgrade python-multipart from 0.0.26 to 0.0.28 (fixes CVE-2026-42561) - All security checks now pass (just security ✅)

Merge branch 'main' into feature/add-token-usage-tracker-to-sdk

6648c06

chore: merge main into feature/add-token-usage-tracker-to-sdk

e3f7438

haroldship mentioned this pull request Jun 4, 2026

[Chore]: Document cuga-agent SDK trajectory requirement (supersedes eval workaround PR #5) #42

Open

3 tasks

haroldship closed this Jun 4, 2026

haroldship deleted the feature/add-token-usage-tracker-to-sdk branch June 4, 2026 12:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add TokenUsageTracker callback for SDK mode#5

feat: add TokenUsageTracker callback for SDK mode#5
sami-marreed wants to merge 4 commits into
mainfrom
feature/add-token-usage-tracker-to-sdk

sami-marreed commented May 26, 2026 •

edited by haroldship

Loading

Uh oh!

haroldship commented Jun 3, 2026

Uh oh!

haroldship commented Jun 3, 2026

Uh oh!

haroldship commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sami-marreed commented May 26, 2026 • edited by haroldship Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core Implementation

Security Updates

Impact

Before

After

Affected Benchmarks

Testing

Documentation

Notes

Uh oh!

haroldship commented Jun 3, 2026

Update

Uh oh!

haroldship commented Jun 3, 2026

Uh oh!

haroldship commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sami-marreed commented May 26, 2026 •

edited by haroldship

Loading