feat(replay): add Recorder for capturing agent runs to JSON by islamborghini · Pull Request #57 · dedalus-labs/dedalus-agents-python

Islam Assanov (islamborghini) · 2026-05-13T10:19:16Z

Summary

What:

Adds dedalus_labs.lib.replay, a new module that ships in two parts:

Recorder captures every model request, model response, and tool result from a DedalusRunner run to a local versioned JSON trace file. Also wires up the existing-but-inactive on_tool_event runner parameter so it actually fires, and adds a parallel on_model_event parameter.
Replayer reads that trace back and deterministically re-runs the agent through the production DedalusRunner with zero network traffic. Reuses the runner unchanged - only two seams are intercepted:
- A _FakeClient that pops recorded ChatCompletion objects in order.
- Synthetic tool callables that pop recorded tool_end results by name.

Recording supports opt-in redaction via composable redactor functions. Three are shipped out of the box (redact_emails, redact_bearer_tokens, redact_api_keys) so sensitive data can be scrubbed before a trace file leaves the machine.

Replay has two escape hatches that let an engineer modify the recorded run while keeping everything else identical:

swap_tool={name: callable} substitutes a real Python function for one tool. Useful for A/B-testing a fix against the recorded conversation.
swap_client=Dedalus() substitutes a real client. Useful for re-running the recorded messages and tools against a live model.

Drift detection: if the runner asks for more model responses than recorded, or calls a tool more times than the trace shows, replay raises a RuntimeError pointing the user at the right swap_* argument.

Why:

FDE feature.
Agent runs are non-deterministic. When a customer reports "the agent did X at 3pm," there is no way to reproduce it today - the workflow is screen-share and guesswork. With record/replay, the customer sends a trace.json, the engineer runs Replayer.from_file("trace.json").run() locally, and the bug is reproducible in seconds.
This is a Dedalus-tailored capture: the runner routes across 6+ providers and composes multiple MCP servers, so the trace captures which provider answered, which MCP server returned what, and the full tool argument and result history in one file.

Recording is opt-in and local-first. A run without the callbacks behaves exactly as before. Replay is zero-runner-change: it uses the production code path with two injected seams, so policy, message building, parallel scheduler, and MCP composition all run as in production.

Lines added: ~1000 total (impl ~400, tests ~370, docs ~220, examples ~150)

Test Plan

Automated:

uv run pytest tests/lib/test_replay_recorder.py tests/lib/test_replay_runner_integration.py tests/lib/test_replay_replayer.py -v - 20 tests covering event order, redaction composition, metadata roundtrip, context manager safety, redactor failure isolation, save idempotency, model event order, tool_end correlation, callback failure isolation, full recorder round-trip, format-version rejection, missing-events rejection, identity replay (with and without tool calls), swap_tool, swap_client, and drift detection.
uv run pytest tests/ --ignore=tests/api_resources - 537 passed, 1 skipped (0 regressions vs main).

Live record-then-replay (requires API key for the record step, replay needs no network):

DEDALUS_API_KEY=<key> uv run python examples/replay/01_record.py
unset DEDALUS_API_KEY
uv run python examples/replay/02_replay.py trace.json

The replay step prints the same final answer as the live run. Running it with the key unset proves zero network traffic during replay.

Multi-tool, multi-model live demo:

DEDALUS_API_KEY=<key> uv run python examples/replay/03_multi_tool.py
uv run python examples/replay/02_replay.py trace_multi.json

Records a multi-step run across two models. The model emits a transfer_to_* handoff tool call that the runner does not currently execute client-side (the server-side handoff path is not wired into this SDK build). The recorder captures this faithfully, and replay reproduces the final state byte-for-byte, including the unresolved handoff.

Repro / Showcase

Three runnable examples in examples/replay/:

01_record.py - 30-line tool-calling agent run recorded to trace.json.
02_replay.py - 25-line replayer for any trace file.
03_multi_tool.py - records a multi-tool, multi-step, multi-model run.

Can be kept or removed when merging.

Tests Added

Unit tests
Integration tests
E2E tests
N/A (no new code paths)

Documentation

Internal (docs/): docs/replay.md - privacy model, public API for Recorder, trace format reference, built-in redactors, Replayer API, swap semantics, drift detection, out-of-scope items.
External (apps/docs/): N/A

Notes for Reviewers

Runner changes are minimal. on_tool_event was already declared on _ExecutionConfig and on the run() signature in main but never emitted - this PR makes the declared hook fire and adds a parallel on_model_event. The runner diff is ~25 lines added, 0 changed. All other new logic lives in lib/replay/.

Replay has zero runner changes. The Replayer injects a fake client (not an AsyncDedalus instance, so the runner picks the sync execution path) and synthetic tool callables built from the recorded tool_end events. It walks the exact production loop. swap_tool callables are wrapped via a lambda so the runner's __name__-based dispatch finds them; the user's original function is not mutated.

Streaming paths (_execute_streaming_async/sync) are intentionally not instrumented in this PR - documented in docs/replay.md under "Out of scope" along with cloud upload, OTel export, trace diffing, and schema migration tooling.

cursor · 2026-05-13T10:19:23Z

PR Summary

Medium Risk
Adds new observation callbacks and event emission in DedalusRunner, plus a new trace/replay subsystem; incorrect event payloads or tool-call correlation could affect debugging and (if misused) leak data, though credentials are explicitly omitted and callbacks are best-effort.

Overview
Introduces dedalus_labs.lib.replay to record agent runs to a versioned local JSON trace (Recorder, redactors, and trace envelope) and replay them deterministically via Replayer using a fake client and recorded tool results, with explicit drift detection and optional swap_tool/swap_client overrides.

Updates DedalusRunner to actually emit observation events: adds on_model_event and wires on_tool_event to fire tool_end events correlated by tool_call_id (including under concurrent tool execution), and emits model_request/model_response events with JSON-serializable payloads while dropping credentials from request events.

Adds docs and runnable examples for record/replay, plus unit + integration tests covering trace shape, redaction behavior, callback error isolation, correlation correctness, and replay drift/override behavior.

^{Reviewed by Cursor Bugbot for commit 0bbaa1c. Bugbot is set up for automated code reviews on this repo. Configure here.}

Islam Assanov (islamborghini) · 2026-05-13T10:31:44Z

Redaction is opt-in by design. Silently modifying trace data by default would make the tool misleading for debugging (an email in a tool argument is legitimate data, not always sensitive). Users who need to scrub data before sharing a trace have three built-in redactors and a composable redact= hook; the privacy section of docs/replay.md covers this explicitly.

…ects

…ntent

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.}

^{Reviewed by Cursor Bugbot for commit 5bd65e0. Configure here.}

Islam Assanov (islamborghini) added 4 commits May 13, 2026 11:25

feat(replay): add Recorder for capturing runner events to JSON

93fb5a6

feat(runner): emit model and tool events for the replay recorder

efc86ff

docs(replay): add replay.md reference and 01_record.py example

95b982e

style(replay): fix import sort order

8f175ec

Islam Assanov (islamborghini) added 5 commits May 13, 2026 18:14

feat(replay): add _FakeClient that serves recorded ChatCompletion obj…

1786e56

…ects

feat(replay): add Replayer class for deterministic trace replay

1487ab9

docs(replay): document Replayer and add 02_replay.py example

2ecc778

examples(replay): add multi-tool multi-step demo

80f9eff

style(replay): drop redundant comments and section dividers

c4807f6

cursor Bot reviewed May 14, 2026

View reviewed changes

Comment thread src/dedalus_labs/lib/runner/core.py Outdated

fix(replay): scrub credentials from model_request event payload

c64da6d

cursor Bot reviewed May 14, 2026

View reviewed changes

Comment thread src/dedalus_labs/lib/replay/_replayer.py Outdated

fix(replay): preserve full recorded messages instead of only first co…

5bd65e0

…ntent

cursor Bot reviewed May 14, 2026

View reviewed changes

Comment thread src/dedalus_labs/lib/runner/core.py

fix(replay): correlate tool_end events by tool_call_id under concurrency

0bbaa1c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(replay): add Recorder for capturing agent runs to JSON#57

feat(replay): add Recorder for capturing agent runs to JSON#57
Islam Assanov (islamborghini) wants to merge 12 commits into
dedalus-labs:mainfrom
islamborghini:feat/replay-recorder

Islam Assanov (islamborghini) commented May 13, 2026 •

edited

Loading

Uh oh!

cursor Bot commented May 13, 2026 •

edited

Loading

Uh oh!

Islam Assanov (islamborghini) commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Islam Assanov (islamborghini) commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Repro / Showcase

Tests Added

Documentation

Notes for Reviewers

Uh oh!

cursor Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Islam Assanov (islamborghini) commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Islam Assanov (islamborghini) commented May 13, 2026 •

edited

Loading

cursor Bot commented May 13, 2026 •

edited

Loading