Problem
Agent Arena has two observability layers that don't talk to each other:
- Game-side (TraceStore): Observations, tool results, scores — keyed by
(agent_id, tick)
- LLM-side (LangSmith/Anthropic console): Prompts, model responses, token usage, latency
When debugging a bad decision at tick 42, a user has to manually correlate between these systems. There's no way to click on a tick and see the full chain: what the agent saw → what prompt was built → what the LLM returned → what tool was called → what happened in the game.
Proposed Solution
Add a trace bridge that links game ticks to framework trace IDs, creating a unified view.
How it works
-
SDK passes tick context to the decide callback:
The decide(observation) function already receives the tick via observation.tick. No change needed.
-
Framework starters attach Arena metadata to LLM calls:
# In starters/langchain/agent.py
result = graph.invoke(
{"observation": obs},
config={"metadata": {"arena_tick": obs.tick, "arena_agent": obs.agent_id}}
)
LangSmith automatically indexes this metadata, making it searchable.
-
TraceStore captures the framework trace URL back:
# After the LLM call, store the link
trace.add_step("framework_trace", {
"langsmith_run_id": run_id,
"langsmith_url": f"https://smith.langchain.com/runs/{run_id}"
})
-
Result: unified per-tick trace
Tick 42:
observation: {pos: [1,2,3], resources: [{name: "berry", dist: 3.2}]}
framework_trace: https://smith.langchain.com/runs/abc123 ← click to see LLM details
decision: {tool: "collect", params: {target: "berry"}}
tool_result: {success: true, items_collected: 1}
score: {resources_collected: 5}
Framework-agnostic design
The bridge should work with any framework:
- LangGraph: LangSmith run metadata + callbacks
- Claude SDK: Anthropic console trace IDs
- OpenAI SDK: OpenAI dashboard request IDs
- Custom: Any string URL/ID the user wants to attach
The SDK provides a simple hook:
def decide(observation: Observation) -> Decision:
# User's framework code here...
observation.trace_metadata["framework_url"] = langsmith_url
return decision
Acceptance Criteria
Dependencies
Estimated Effort
1 day (after #74 is complete)
Problem
Agent Arena has two observability layers that don't talk to each other:
(agent_id, tick)When debugging a bad decision at tick 42, a user has to manually correlate between these systems. There's no way to click on a tick and see the full chain: what the agent saw → what prompt was built → what the LLM returned → what tool was called → what happened in the game.
Proposed Solution
Add a trace bridge that links game ticks to framework trace IDs, creating a unified view.
How it works
SDK passes tick context to the decide callback:
The
decide(observation)function already receives the tick viaobservation.tick. No change needed.Framework starters attach Arena metadata to LLM calls:
LangSmith automatically indexes this metadata, making it searchable.
TraceStore captures the framework trace URL back:
Result: unified per-tick trace
Framework-agnostic design
The bridge should work with any framework:
The SDK provides a simple hook:
Acceptance Criteria
(agent_id, tick)arena_tick+arena_agentas LangSmith run metadataDependencies
Estimated Effort
1 day (after #74 is complete)