feat(retention): export/clean/rehydrate endpoints for task content#243
feat(retention): export/clean/rehydrate endpoints for task content#243smoreinis wants to merge 2 commits into
Conversation
Adds an operational surface for bounded retention of task chat content
in shared infrastructure. Callers can snapshot a task's content,
delete it from the shared stores, and later restore it byte-identically
from the snapshot — preserving message IDs and timestamps so tool-call
and reasoning references remain valid.
Three new endpoints under /tasks/{task_id}:
- GET /export — returns a self-contained snapshot (messages + task_states)
- POST /clean — deletes content across Mongo messages, Mongo task_states,
Postgres events; resets agent_task_tracker cursors; sets tasks.cleaned_at
- POST /rehydrate — restores content from a snapshot, clears cleaned_at
Domain layer lives in TaskRetentionService so the eventual scheduled
sweep workflow and the HTTP endpoints share the same code path.
Cleanup uses a "Mongo deletes first, Postgres marker last" order so
retries after partial failure converge correctly. The active-task,
idle-threshold, and unprocessed-events guards refuse cleanup when the
task isn't safe to drop.
Schema:
- New nullable tasks.cleaned_at column (TIMESTAMPTZ, metadata-only ALTER)
- No new audit table — cleanup operations emit structured log lines
Other changes:
- adapter_mongodb.batch_create now translates pymongo BulkWriteError
with all-duplicate-key sub-errors into DuplicateItemError (HTTP 400)
instead of letting it surface as ServiceError (HTTP 500)
- New EventRepository.delete_by_task_id and
AgentTaskTrackerRepository.reset_cursors_for_task methods
Tests: 13 integration tests covering happy paths, all precondition
guards, and the byte-identical export → clean → rehydrate round-trip.
✱ Stainless preview buildsThis PR will update the openapi python typescript Edit this comment to update them. They will appear in their respective SDK's changelogs. ✅ agentex-sdk-openapi studio · code · diff
✅ agentex-sdk-typescript studio · code · diff
✅ agentex-sdk-python studio · code · diff
This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push. |
… task_ids
Two P1 issues from review.
**Authorization (security)**
The three retention endpoints were inheriting only the global auth
middleware, not the resource-level authorization that every other
/tasks/{task_id}/* route enforces. Any authenticated principal could
export, clean, or rehydrate a task they don't own.
Adds DAuthorizedId to all three handlers matching the existing pattern:
- export → AuthorizedOperationType.read
- clean → AuthorizedOperationType.delete
- rehydrate → AuthorizedOperationType.update
**Per-entity task_id validation**
snapshot.task_id was checked against the path task_id, but each embedded
TaskMessageEntity and StateEntity carries its own task_id field that
batch_create forwards straight to MongoDB. A caller could pass
snapshot.task_id = "A" with messages whose task_id = "B" and pollute
task B's collection — Mongo has no FK to reject it.
Adds explicit per-item validation in rehydrate_task before any insert.
Returns 400 with the offending index in the message so the caller can
find the bad entry.
Tests: 2 new integration tests covering the mismatched-task_id cases
for both messages and task_states. Full suite (15 tests) still passes.
Summary
Adds an operational surface for bounded retention of task chat content in shared infrastructure. Callers can snapshot a task's content, delete it from the shared stores, and later restore it byte-identically — preserving message IDs and timestamps so tool-call and reasoning references between messages stay valid.
New endpoints
GET/tasks/{task_id}/exportPOST/tasks/{task_id}/cleanmessages, Mongotask_states, Postgresevents; resetsagent_task_tracker.last_processed_event_id; setstasks.cleaned_at.POST/tasks/{task_id}/rehydratecleaned_at.Design notes
TaskRetentionService— both the HTTP routes and the (future) scheduled Temporal sweep workflow will call the same service methods, so the cleanup path is exercised by the same code in both contexts.task_id) first, then Postgres operations, thentasks.cleaned_atlast. A retry after partial failure converges because each step is idempotent andcleaned_atis the gate that keeps subsequent runs from re-doing work.clean:status == RUNNING(regardless offorce=true).idle_days(default 7) unlessforce=true.events_deleted > 0on an idle-checked task is a signal).task_cleanup_completed,task_rehydrate_completed) with the result payload. Datadog log search is the forensic trail.tasks.paramsis out of scope for v1 — not exported, not stripped during cleanup, not restored. If it turns out to carry chat content for specific agents, follow up.Schema change
A single nullable column on
tasks:This is a metadata-only ALTER (Postgres ≥11 doesn't rewrite the table). Falls within the project's safe-migration shape — passes the migration safety linter.
Other changes
adapter_mongodb.batch_createnow translates pymongoBulkWriteErrorcontaining only duplicate-key sub-errors (code 11000) intoDuplicateItemError(HTTP 400). Previously it fell through to the genericExceptionhandler and surfaced as HTTP 500. Narrowly scoped — non-duplicate bulk-write errors still surface asServiceError.EventRepository.delete_by_task_id(task_id) → intAgentTaskTrackerRepository.reset_cursors_for_task(task_id) → intTests
13 integration tests in
tests/integration/api/task_retention/test_task_retention_api.pycovering:Suite runs in ~24s via testcontainers and passes.
Test plan
make test FILE=tests/integration/api/task_retention/taskstable (metadata-only, should be instant)export→ save snapshot →clean→rehydratewith the snapshot → re-export → diffcleaned_atsurfaces inGET /tasks/{id}responsesFollow-ups (not in this PR)
clean_taskon a daily sweepcleanis destructive and should require elevated privilege beyond task ownershiptasks.paramscontent stripping if it proves to carry chat content for any agentGreptile Summary
This PR introduces a bounded-retention operational surface for task chat content: a
GET /exportendpoint that snapshots messages and task states, aPOST /cleanendpoint that deletes them across Mongo and Postgres, and aPOST /rehydrateendpoint that restores a snapshot byte-for-byte using caller-preserved IDs. A nullablecleaned_atcolumn is added to thetaskstable and is the idempotency gate for the entire flow.DAuthorizedIdfor resource-level auth (read/delete/update respectively), andrehydrate_taskvalidates per-entitytask_idfields before touching either store — both previously-flagged gaps are closed.adapter_mongodb.batch_createnow correctly translates all-duplicate-keyBulkWriteErrorintoDuplicateItemError(HTTP 400) instead of falling through to HTTP 500.clean_task(Mongo deletes → Postgres deletes →cleaned_atlast) is carefully chosen so retries after partial failure converge correctly; this is explicitly documented in the docstring.Confidence Score: 5/5
Safe to merge. The destructive clean endpoint is guarded by status, idle-threshold, and unprocessed-events checks; auth is wired on all three endpoints; and the round-trip invariant is validated by integration tests.
The PR closes all previously-identified gaps (auth via DAuthorizedId, per-entity task_id validation in rehydrate). The cross-database operation ordering is correct and idempotent. The only remaining observation is that idle_days is caller-tunable below the 7-day default, which is a policy question flagged as non-blocking.
No files require special attention beyond the idle_days policy note in the schema.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A["POST /tasks/id/clean"] --> B{"cleaned_at != NULL?"} B -- yes --> C["Return empty result / no-op"] B -- no --> D{"status == RUNNING?"} D -- yes --> E["400 ClientError"] D -- no --> F{"enforce_idle AND not idle?"} F -- yes --> G["400 ClientError"] F -- no --> H{"unprocessed events?"} H -- yes --> I["400 ClientError"] H -- no --> J["Mongo: delete messages"] J --> K["Mongo: delete task_states"] K --> L["Postgres: delete events"] L --> M["Postgres: reset tracker cursors"] M --> N["Postgres: set cleaned_at = now"] N --> O["Return TaskCleanupResultEntity"] P["POST /tasks/id/rehydrate"] --> R{"task_id mismatch?"} R -- yes --> S["400 ClientError"] R -- no --> T{"entity task_id mismatch?"} T -- yes --> U["400 ClientError"] T -- no --> V{"cleaned_at == NULL?"} V -- yes --> W["400 ClientError"] V -- no --> X["Mongo: batch insert messages"] X --> Y["Mongo: batch insert task_states"] Y --> Z["Postgres: set cleaned_at = NULL"]Prompt To Fix All With AI
Reviews (2): Last reviewed commit: "address review: per-task authz on retent..." | Re-trigger Greptile