Skip to content

feat(replay): add Recorder for capturing agent runs to JSON#57

Open
Islam Assanov (islamborghini) wants to merge 12 commits into
dedalus-labs:mainfrom
islamborghini:feat/replay-recorder
Open

feat(replay): add Recorder for capturing agent runs to JSON#57
Islam Assanov (islamborghini) wants to merge 12 commits into
dedalus-labs:mainfrom
islamborghini:feat/replay-recorder

Conversation

@islamborghini
Copy link
Copy Markdown

@islamborghini Islam Assanov (islamborghini) commented May 13, 2026

Summary

What:

Adds dedalus_labs.lib.replay, a new module that ships in two parts:

  1. Recorder captures every model request, model response, and tool result from a DedalusRunner run to a local versioned JSON trace file. Also wires up the existing-but-inactive on_tool_event runner parameter so it actually fires, and adds a parallel on_model_event parameter.

  2. Replayer reads that trace back and deterministically re-runs the agent through the production DedalusRunner with zero network traffic. Reuses the runner unchanged - only two seams are intercepted:

    • A _FakeClient that pops recorded ChatCompletion objects in order.
    • Synthetic tool callables that pop recorded tool_end results by name.

Recording supports opt-in redaction via composable redactor functions. Three are shipped out of the box (redact_emails, redact_bearer_tokens, redact_api_keys) so sensitive data can be scrubbed before a trace file leaves the machine.

Replay has two escape hatches that let an engineer modify the recorded run while keeping everything else identical:

  • swap_tool={name: callable} substitutes a real Python function for one tool. Useful for A/B-testing a fix against the recorded conversation.
  • swap_client=Dedalus() substitutes a real client. Useful for re-running the recorded messages and tools against a live model.

Drift detection: if the runner asks for more model responses than recorded, or calls a tool more times than the trace shows, replay raises a RuntimeError pointing the user at the right swap_* argument.

Why:

FDE feature.
Agent runs are non-deterministic. When a customer reports "the agent did X at 3pm," there is no way to reproduce it today - the workflow is screen-share and guesswork. With record/replay, the customer sends a trace.json, the engineer runs Replayer.from_file("trace.json").run() locally, and the bug is reproducible in seconds.
This is a Dedalus-tailored capture: the runner routes across 6+ providers and composes multiple MCP servers, so the trace captures which provider answered, which MCP server returned what, and the full tool argument and result history in one file.

Recording is opt-in and local-first. A run without the callbacks behaves exactly as before. Replay is zero-runner-change: it uses the production code path with two injected seams, so policy, message building, parallel scheduler, and MCP composition all run as in production.

Lines added: ~1000 total (impl ~400, tests ~370, docs ~220, examples ~150)

Test Plan

Automated:

  • uv run pytest tests/lib/test_replay_recorder.py tests/lib/test_replay_runner_integration.py tests/lib/test_replay_replayer.py -v - 20 tests covering event order, redaction composition, metadata roundtrip, context manager safety, redactor failure isolation, save idempotency, model event order, tool_end correlation, callback failure isolation, full recorder round-trip, format-version rejection, missing-events rejection, identity replay (with and without tool calls), swap_tool, swap_client, and drift detection.
  • uv run pytest tests/ --ignore=tests/api_resources - 537 passed, 1 skipped (0 regressions vs main).

Live record-then-replay (requires API key for the record step, replay needs no network):

DEDALUS_API_KEY=<key> uv run python examples/replay/01_record.py
unset DEDALUS_API_KEY
uv run python examples/replay/02_replay.py trace.json

The replay step prints the same final answer as the live run. Running it with the key unset proves zero network traffic during replay.

Multi-tool, multi-model live demo:

DEDALUS_API_KEY=<key> uv run python examples/replay/03_multi_tool.py
uv run python examples/replay/02_replay.py trace_multi.json

Records a multi-step run across two models. The model emits a transfer_to_* handoff tool call that the runner does not currently execute client-side (the server-side handoff path is not wired into this SDK build). The recorder captures this faithfully, and replay reproduces the final state byte-for-byte, including the unresolved handoff.

Repro / Showcase

Three runnable examples in examples/replay/:

  • 01_record.py - 30-line tool-calling agent run recorded to trace.json.
  • 02_replay.py - 25-line replayer for any trace file.
  • 03_multi_tool.py - records a multi-tool, multi-step, multi-model run.

Can be kept or removed when merging.

Tests Added

  • Unit tests
  • Integration tests
  • E2E tests
  • N/A (no new code paths)

Documentation

  • Internal (docs/): docs/replay.md - privacy model, public API for Recorder, trace format reference, built-in redactors, Replayer API, swap semantics, drift detection, out-of-scope items.
  • External (apps/docs/): N/A

Notes for Reviewers

Runner changes are minimal. on_tool_event was already declared on _ExecutionConfig and on the run() signature in main but never emitted - this PR makes the declared hook fire and adds a parallel on_model_event. The runner diff is ~25 lines added, 0 changed. All other new logic lives in lib/replay/.

Replay has zero runner changes. The Replayer injects a fake client (not an AsyncDedalus instance, so the runner picks the sync execution path) and synthetic tool callables built from the recorded tool_end events. It walks the exact production loop. swap_tool callables are wrapped via a lambda so the runner's __name__-based dispatch finds them; the user's original function is not mutated.

Streaming paths (_execute_streaming_async/sync) are intentionally not instrumented in this PR - documented in docs/replay.md under "Out of scope" along with cloud upload, OTel export, trace diffing, and schema migration tooling.

@cursor
Copy link
Copy Markdown

cursor Bot commented May 13, 2026

PR Summary

Medium Risk
Adds new observation callbacks and event emission in DedalusRunner, plus a new trace/replay subsystem; incorrect event payloads or tool-call correlation could affect debugging and (if misused) leak data, though credentials are explicitly omitted and callbacks are best-effort.

Overview
Introduces dedalus_labs.lib.replay to record agent runs to a versioned local JSON trace (Recorder, redactors, and trace envelope) and replay them deterministically via Replayer using a fake client and recorded tool results, with explicit drift detection and optional swap_tool/swap_client overrides.

Updates DedalusRunner to actually emit observation events: adds on_model_event and wires on_tool_event to fire tool_end events correlated by tool_call_id (including under concurrent tool execution), and emits model_request/model_response events with JSON-serializable payloads while dropping credentials from request events.

Adds docs and runnable examples for record/replay, plus unit + integration tests covering trace shape, redaction behavior, callback error isolation, correlation correctness, and replay drift/override behavior.

Reviewed by Cursor Bugbot for commit 0bbaa1c. Bugbot is set up for automated code reviews on this repo. Configure here.

@islamborghini
Copy link
Copy Markdown
Author

Redaction is opt-in by design. Silently modifying trace data by default would make the tool misleading for debugging (an email in a tool argument is legitimate data, not always sensitive). Users who need to scrub data before sharing a trace have three built-in redactors and a composable redact= hook; the privacy section of docs/replay.md covers this explicitly.

Comment thread src/dedalus_labs/lib/runner/core.py Outdated
Comment thread src/dedalus_labs/lib/replay/_replayer.py Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.

Reviewed by Cursor Bugbot for commit 5bd65e0. Configure here.

Comment thread src/dedalus_labs/lib/runner/core.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant