Skip to content

feat(retention): export/clean/rehydrate endpoints for task content#243

Open
smoreinis wants to merge 2 commits into
mainfrom
task-retention-endpoints
Open

feat(retention): export/clean/rehydrate endpoints for task content#243
smoreinis wants to merge 2 commits into
mainfrom
task-retention-endpoints

Conversation

@smoreinis
Copy link
Copy Markdown
Collaborator

@smoreinis smoreinis commented May 19, 2026

Summary

Adds an operational surface for bounded retention of task chat content in shared infrastructure. Callers can snapshot a task's content, delete it from the shared stores, and later restore it byte-identically — preserving message IDs and timestamps so tool-call and reasoning references between messages stay valid.

New endpoints

Method Path Behavior
GET /tasks/{task_id}/export Returns a self-contained snapshot (messages + task_states). Same shape that rehydrate accepts — schema parity is the load-bearing invariant for byte-identical round-trip.
POST /tasks/{task_id}/clean Deletes content across Mongo messages, Mongo task_states, Postgres events; resets agent_task_tracker.last_processed_event_id; sets tasks.cleaned_at.
POST /tasks/{task_id}/rehydrate Restores content from a snapshot with caller-supplied IDs preserved; clears cleaned_at.

Design notes

  • Domain logic in TaskRetentionService — both the HTTP routes and the (future) scheduled Temporal sweep workflow will call the same service methods, so the cleanup path is exercised by the same code in both contexts.
  • Cleanup operation order — Mongo deletes (idempotent by task_id) first, then Postgres operations, then tasks.cleaned_at last. A retry after partial failure converges because each step is idempotent and cleaned_at is the gate that keeps subsequent runs from re-doing work.
  • Guards on clean:
    • Refuses tasks with status == RUNNING (regardless of force=true).
    • Refuses tasks with unprocessed events past their cursor.
    • Refuses tasks not idle for idle_days (default 7) unless force=true.
    • Already-cleaned tasks return an empty result rather than raising.
  • Optimistic concurrency for the unprocessed-events check — no row locks; an event arriving in the narrow race window between check and delete will be deleted with the rest. Acceptable for v1; surfaced in audit logs (events_deleted > 0 on an idle-checked task is a signal).
  • No audit table for v1 — cleanup operations emit structured log lines (task_cleanup_completed, task_rehydrate_completed) with the result payload. Datadog log search is the forensic trail.
  • ID preservation is a caller contract — the caller (external integrator) must capture Agentex-generated IDs at write time and supply them at rehydrate time. Agentex does not enforce this; the round-trip invariant depends on it.
  • tasks.params is out of scope for v1 — not exported, not stripped during cleanup, not restored. If it turns out to carry chat content for specific agents, follow up.

Schema change

A single nullable column on tasks:

ALTER TABLE tasks ADD COLUMN cleaned_at TIMESTAMPTZ NULL;

This is a metadata-only ALTER (Postgres ≥11 doesn't rewrite the table). Falls within the project's safe-migration shape — passes the migration safety linter.

Other changes

  • adapter_mongodb.batch_create now translates pymongo BulkWriteError containing only duplicate-key sub-errors (code 11000) into DuplicateItemError (HTTP 400). Previously it fell through to the generic Exception handler and surfaced as HTTP 500. Narrowly scoped — non-duplicate bulk-write errors still surface as ServiceError.
  • Two new repository methods:
    • EventRepository.delete_by_task_id(task_id) → int
    • AgentTaskTrackerRepository.reset_cursors_for_task(task_id) → int

Tests

13 integration tests in tests/integration/api/task_retention/test_task_retention_api.py covering:

  • Export: happy path, empty task, nonexistent task (404)
  • Clean: success across all surfaces, cursor reset, RUNNING refused, already-cleaned no-op, unprocessed events refused, nonexistent task (404)
  • Rehydrate: byte-identical round-trip, active task refused, task_id mismatch refused, ID collision refused

Suite runs in ~24s via testcontainers and passes.

Test plan

  • Run make test FILE=tests/integration/api/task_retention/
  • Sanity-check the migration applies cleanly on a non-empty tasks table (metadata-only, should be instant)
  • Manual round-trip via curl against a non-RUNNING task: export → save snapshot → cleanrehydrate with the snapshot → re-export → diff
  • Confirm cleaned_at surfaces in GET /tasks/{id} responses
  • Confirm the existing tasks API and tests still pass (no behavior change for active tasks)

Follow-ups (not in this PR)

  • Scheduled Temporal cleanup workflow that calls clean_task on a daily sweep
  • Auth gating — clean is destructive and should require elevated privilege beyond task ownership
  • Decision on tasks.params content stripping if it proves to carry chat content for any agent

Greptile Summary

This PR introduces a bounded-retention operational surface for task chat content: a GET /export endpoint that snapshots messages and task states, a POST /clean endpoint that deletes them across Mongo and Postgres, and a POST /rehydrate endpoint that restores a snapshot byte-for-byte using caller-preserved IDs. A nullable cleaned_at column is added to the tasks table and is the idempotency gate for the entire flow.

  • All three new endpoints are wired with DAuthorizedId for resource-level auth (read/delete/update respectively), and rehydrate_task validates per-entity task_id fields before touching either store — both previously-flagged gaps are closed.
  • adapter_mongodb.batch_create now correctly translates all-duplicate-key BulkWriteError into DuplicateItemError (HTTP 400) instead of falling through to HTTP 500.
  • Operation ordering in clean_task (Mongo deletes → Postgres deletes → cleaned_at last) is carefully chosen so retries after partial failure converge correctly; this is explicitly documented in the docstring.

Confidence Score: 5/5

Safe to merge. The destructive clean endpoint is guarded by status, idle-threshold, and unprocessed-events checks; auth is wired on all three endpoints; and the round-trip invariant is validated by integration tests.

The PR closes all previously-identified gaps (auth via DAuthorizedId, per-entity task_id validation in rehydrate). The cross-database operation ordering is correct and idempotent. The only remaining observation is that idle_days is caller-tunable below the 7-day default, which is a policy question flagged as non-blocking.

No files require special attention beyond the idle_days policy note in the schema.

Important Files Changed

Filename Overview
agentex/src/domain/services/task_retention_service.py Core retention logic: clean, export, and rehydrate with guards for RUNNING status, idle threshold, and unprocessed events. Operation ordering is carefully documented and retry-safe. idle_days is caller-tunable without admin restriction.
agentex/src/api/routes/task_retention.py Three new endpoints with DAuthorizedId applied (read/delete/update) — addresses previous auth review comment. Thin wrapper over use case layer.
agentex/src/adapters/crud_store/adapter_mongodb.py Adds BulkWriteError handler: translates all-duplicate-key bulk errors into DuplicateItemError (400); mixed or non-duplicate errors surface as ServiceError. Logic is correct.
agentex/src/api/schemas/task_retention.py Thin schema wrappers over domain entities; ExportTaskResponse and RehydrateTaskRequest share the TaskSnapshotEntity shape intentionally to enforce round-trip parity.
agentex/tests/integration/api/task_retention/test_task_retention_api.py 13 integration tests covering round-trip parity, all precondition guards (RUNNING, idle, unprocessed events, cleaned state, task_id mismatch, duplicate IDs), and cross-store verification. Good coverage.
agentex/database/migrations/alembic/versions/2026_05_19_1929_adding_task_cleaned_at_6c942325c828.py Metadata-only nullable column addition on tasks table; safe for Postgres ≥11 with no table rewrite. Downgrade path included.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["POST /tasks/id/clean"] --> B{"cleaned_at != NULL?"}
    B -- yes --> C["Return empty result / no-op"]
    B -- no --> D{"status == RUNNING?"}
    D -- yes --> E["400 ClientError"]
    D -- no --> F{"enforce_idle AND not idle?"}
    F -- yes --> G["400 ClientError"]
    F -- no --> H{"unprocessed events?"}
    H -- yes --> I["400 ClientError"]
    H -- no --> J["Mongo: delete messages"]
    J --> K["Mongo: delete task_states"]
    K --> L["Postgres: delete events"]
    L --> M["Postgres: reset tracker cursors"]
    M --> N["Postgres: set cleaned_at = now"]
    N --> O["Return TaskCleanupResultEntity"]

    P["POST /tasks/id/rehydrate"] --> R{"task_id mismatch?"}
    R -- yes --> S["400 ClientError"]
    R -- no --> T{"entity task_id mismatch?"}
    T -- yes --> U["400 ClientError"]
    T -- no --> V{"cleaned_at == NULL?"}
    V -- yes --> W["400 ClientError"]
    V -- no --> X["Mongo: batch insert messages"]
    X --> Y["Mongo: batch insert task_states"]
    Y --> Z["Postgres: set cleaned_at = NULL"]
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
agentex/src/api/schemas/task_retention.py:23-27
The `idle_days` field lets any task owner shorten the idle window to as little as 1 day without using `force=True`. Since `force` carries an "Admin use only" note, the intent is that aggressive cleanup requires elevated privilege — but a caller can achieve nearly the same effect by supplying `idle_days=1` without `force`. Consider capping `idle_days` to a maximum, or documenting that `idle_days` is also an admin-only parameter and enforcing it server-side once proper admin auth lands.

```suggestion
    idle_days: int = Field(
        default=7,
        ge=1,
        le=365,
        description=(
            "Idle threshold in days (ignored when force=true). "
            "Values below the default 7 days have the same practical effect "
            "as reducing the retention window and should be treated as admin-only."
        ),
    )
```

Reviews (2): Last reviewed commit: "address review: per-task authz on retent..." | Re-trigger Greptile

Adds an operational surface for bounded retention of task chat content
in shared infrastructure. Callers can snapshot a task's content,
delete it from the shared stores, and later restore it byte-identically
from the snapshot — preserving message IDs and timestamps so tool-call
and reasoning references remain valid.

Three new endpoints under /tasks/{task_id}:
- GET /export — returns a self-contained snapshot (messages + task_states)
- POST /clean — deletes content across Mongo messages, Mongo task_states,
  Postgres events; resets agent_task_tracker cursors; sets tasks.cleaned_at
- POST /rehydrate — restores content from a snapshot, clears cleaned_at

Domain layer lives in TaskRetentionService so the eventual scheduled
sweep workflow and the HTTP endpoints share the same code path.

Cleanup uses a "Mongo deletes first, Postgres marker last" order so
retries after partial failure converge correctly. The active-task,
idle-threshold, and unprocessed-events guards refuse cleanup when the
task isn't safe to drop.

Schema:
- New nullable tasks.cleaned_at column (TIMESTAMPTZ, metadata-only ALTER)
- No new audit table — cleanup operations emit structured log lines

Other changes:
- adapter_mongodb.batch_create now translates pymongo BulkWriteError
  with all-duplicate-key sub-errors into DuplicateItemError (HTTP 400)
  instead of letting it surface as ServiceError (HTTP 500)
- New EventRepository.delete_by_task_id and
  AgentTaskTrackerRepository.reset_cursors_for_task methods

Tests: 13 integration tests covering happy paths, all precondition
guards, and the byte-identical export → clean → rehydrate round-trip.
@smoreinis smoreinis requested a review from a team as a code owner May 19, 2026 20:12
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 19, 2026

✱ Stainless preview builds

This PR will update the agentex-sdk SDKs with the following commit messages.

openapi

feat(api): add export/clean/rehydrate to tasks, cleaned_at field, entity types

python

feat(api): add cleaned_at field to task response types

typescript

feat(api): add cleaned_at field to tasks responses

Edit this comment to update them. They will appear in their respective SDK's changelogs.

agentex-sdk-openapi studio · code · diff

Your SDK build had at least one new note diagnostic, which is a regression from the base state.
generate ✅

New diagnostics (3 note)
💡 Endpoint/NotConfigured: Skipped endpoint because it's not in your Stainless config: `get /tasks/{task_id}/export`
💡 Endpoint/NotConfigured: Skipped endpoint because it's not in your Stainless config: `post /tasks/{task_id}/clean`
💡 Endpoint/NotConfigured: Skipped endpoint because it's not in your Stainless config: `post /tasks/{task_id}/rehydrate`
agentex-sdk-typescript studio · code · diff

Your SDK build had at least one new note diagnostic, which is a regression from the base state.
generate ⚠️build ⏭️ (prev: build ✅) → lint ⏭️ (prev: lint ✅) → test ⏭️ (prev: test ✅)

New diagnostics (3 note)
💡 Endpoint/NotConfigured: Skipped endpoint because it's not in your Stainless config: `get /tasks/{task_id}/export`
💡 Endpoint/NotConfigured: Skipped endpoint because it's not in your Stainless config: `post /tasks/{task_id}/clean`
💡 Endpoint/NotConfigured: Skipped endpoint because it's not in your Stainless config: `post /tasks/{task_id}/rehydrate`
agentex-sdk-python studio · code · diff

Your SDK build had at least one new note diagnostic, which is a regression from the base state.
generate ⚠️build ⏭️ (prev: build ✅) → lint ⏭️ (prev: lint ✅) → test ⏭️ (prev: test ✅)

New diagnostics (3 note)
💡 Endpoint/NotConfigured: Skipped endpoint because it's not in your Stainless config: `get /tasks/{task_id}/export`
💡 Endpoint/NotConfigured: Skipped endpoint because it's not in your Stainless config: `post /tasks/{task_id}/clean`
💡 Endpoint/NotConfigured: Skipped endpoint because it's not in your Stainless config: `post /tasks/{task_id}/rehydrate`

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-05-19 20:32:33 UTC

Comment thread agentex/src/api/routes/task_retention.py
Comment thread agentex/src/domain/services/task_retention_service.py
… task_ids

Two P1 issues from review.

**Authorization (security)**

The three retention endpoints were inheriting only the global auth
middleware, not the resource-level authorization that every other
/tasks/{task_id}/* route enforces. Any authenticated principal could
export, clean, or rehydrate a task they don't own.

Adds DAuthorizedId to all three handlers matching the existing pattern:
- export → AuthorizedOperationType.read
- clean  → AuthorizedOperationType.delete
- rehydrate → AuthorizedOperationType.update

**Per-entity task_id validation**

snapshot.task_id was checked against the path task_id, but each embedded
TaskMessageEntity and StateEntity carries its own task_id field that
batch_create forwards straight to MongoDB. A caller could pass
snapshot.task_id = "A" with messages whose task_id = "B" and pollute
task B's collection — Mongo has no FK to reject it.

Adds explicit per-item validation in rehydrate_task before any insert.
Returns 400 with the offending index in the message so the caller can
find the bad entry.

Tests: 2 new integration tests covering the mismatched-task_id cases
for both messages and task_states. Full suite (15 tests) still passes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant