feat(replay): add Recorder for capturing agent runs to JSON#57
feat(replay): add Recorder for capturing agent runs to JSON#57Islam Assanov (islamborghini) wants to merge 12 commits into
Conversation
PR SummaryMedium Risk Overview Updates Adds docs and runnable examples for record/replay, plus unit + integration tests covering trace shape, redaction behavior, callback error isolation, correlation correctness, and replay drift/override behavior. Reviewed by Cursor Bugbot for commit 0bbaa1c. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
Redaction is opt-in by design. Silently modifying trace data by default would make the tool misleading for debugging (an email in a tool argument is legitimate data, not always sensitive). Users who need to scrub data before sharing a trace have three built-in redactors and a composable redact= hook; the privacy section of docs/replay.md covers this explicitly. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.
Reviewed by Cursor Bugbot for commit 5bd65e0. Configure here.

Summary
What:
Adds
dedalus_labs.lib.replay, a new module that ships in two parts:Recordercaptures every model request, model response, and tool result from aDedalusRunnerrun to a local versioned JSON trace file. Also wires up the existing-but-inactiveon_tool_eventrunner parameter so it actually fires, and adds a parallelon_model_eventparameter.Replayerreads that trace back and deterministically re-runs the agent through the productionDedalusRunnerwith zero network traffic. Reuses the runner unchanged - only two seams are intercepted:_FakeClientthat pops recordedChatCompletionobjects in order.tool_endresults by name.Recording supports opt-in redaction via composable redactor functions. Three are shipped out of the box (
redact_emails,redact_bearer_tokens,redact_api_keys) so sensitive data can be scrubbed before a trace file leaves the machine.Replay has two escape hatches that let an engineer modify the recorded run while keeping everything else identical:
swap_tool={name: callable}substitutes a real Python function for one tool. Useful for A/B-testing a fix against the recorded conversation.swap_client=Dedalus()substitutes a real client. Useful for re-running the recorded messages and tools against a live model.Drift detection: if the runner asks for more model responses than recorded, or calls a tool more times than the trace shows, replay raises a
RuntimeErrorpointing the user at the rightswap_*argument.Why:
FDE feature.
Agent runs are non-deterministic. When a customer reports "the agent did X at 3pm," there is no way to reproduce it today - the workflow is screen-share and guesswork. With record/replay, the customer sends a
trace.json, the engineer runsReplayer.from_file("trace.json").run()locally, and the bug is reproducible in seconds.This is a Dedalus-tailored capture: the runner routes across 6+ providers and composes multiple MCP servers, so the trace captures which provider answered, which MCP server returned what, and the full tool argument and result history in one file.
Recording is opt-in and local-first. A run without the callbacks behaves exactly as before. Replay is zero-runner-change: it uses the production code path with two injected seams, so policy, message building, parallel scheduler, and MCP composition all run as in production.
Lines added: ~1000 total (impl ~400, tests ~370, docs ~220, examples ~150)
Test Plan
Automated:
uv run pytest tests/lib/test_replay_recorder.py tests/lib/test_replay_runner_integration.py tests/lib/test_replay_replayer.py -v- 20 tests covering event order, redaction composition, metadata roundtrip, context manager safety, redactor failure isolation, save idempotency, model event order, tool_end correlation, callback failure isolation, full recorder round-trip, format-version rejection, missing-events rejection, identity replay (with and without tool calls), swap_tool, swap_client, and drift detection.uv run pytest tests/ --ignore=tests/api_resources- 537 passed, 1 skipped (0 regressions vs main).Live record-then-replay (requires API key for the record step, replay needs no network):
The replay step prints the same final answer as the live run. Running it with the key unset proves zero network traffic during replay.
Multi-tool, multi-model live demo:
Records a multi-step run across two models. The model emits a
transfer_to_*handoff tool call that the runner does not currently execute client-side (the server-side handoff path is not wired into this SDK build). The recorder captures this faithfully, and replay reproduces the final state byte-for-byte, including the unresolved handoff.Repro / Showcase
Three runnable examples in
examples/replay/:01_record.py- 30-line tool-calling agent run recorded totrace.json.02_replay.py- 25-line replayer for any trace file.03_multi_tool.py- records a multi-tool, multi-step, multi-model run.Can be kept or removed when merging.
Tests Added
Documentation
docs/):docs/replay.md- privacy model, public API for Recorder, trace format reference, built-in redactors, Replayer API, swap semantics, drift detection, out-of-scope items.apps/docs/): N/ANotes for Reviewers
Runner changes are minimal.
on_tool_eventwas already declared on_ExecutionConfigand on therun()signature inmainbut never emitted - this PR makes the declared hook fire and adds a parallelon_model_event. The runner diff is ~25 lines added, 0 changed. All other new logic lives inlib/replay/.Replay has zero runner changes. The Replayer injects a fake client (not an
AsyncDedalusinstance, so the runner picks the sync execution path) and synthetic tool callables built from the recordedtool_endevents. It walks the exact production loop.swap_toolcallables are wrapped via a lambda so the runner's__name__-based dispatch finds them; the user's original function is not mutated.Streaming paths (
_execute_streaming_async/sync) are intentionally not instrumented in this PR - documented indocs/replay.mdunder "Out of scope" along with cloud upload, OTel export, trace diffing, and schema migration tooling.