feat: add per-SP piece cleanup job to bound storage growth#346
feat: add per-SP piece cleanup job to bound storage growth#346Chaitu-Tatipamula wants to merge 9 commits intoFilOzone:mainfrom
Conversation
c8ea542 to
c98b174
Compare
There was a problem hiding this comment.
Pull request overview
Adds a new “piece cleanup” mechanism to automatically delete old pieces from storage providers when they exceed a configured quota, and integrates it into the pg-boss job scheduler (including optional deal-job gating when over quota).
Changes:
- Introduces
PieceCleanupService(+ module + tests) to compute stored bytes per SP and delete oldest pieces until back under quota. - Extends pg-boss scheduling/worker handling with a new per-SP
piece_cleanupjob type and adds deal-job over-quota gating. - Adds new env/config knobs and documentation for quota size, cleanup rate, and cleanup runtime cap.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/environment-variables.md | Documents new env vars for quota + cleanup scheduling/runtime. |
| apps/backend/src/piece-cleanup/piece-cleanup.service.ts | Implements quota check + cleanup loop and Synapse delete calls. |
| apps/backend/src/piece-cleanup/piece-cleanup.service.spec.ts | Adds unit tests for stored-bytes calc, cleanup looping, abort/idempotency. |
| apps/backend/src/piece-cleanup/piece-cleanup.module.ts | Wires cleanup service with TypeORM Deal repository + Wallet SDK. |
| apps/backend/src/jobs/jobs.service.ts | Adds piece_cleanup job handling, schedules it per SP, and gates deal jobs when over quota. |
| apps/backend/src/jobs/jobs.service.spec.ts | Updates schedule-row expectations and constructor deps for new service/config. |
| apps/backend/src/jobs/jobs.module.ts | Imports PieceCleanupModule so JobsService can use it. |
| apps/backend/src/database/entities/job-schedule-state.entity.ts | Extends JobType union to include piece_cleanup. |
| apps/backend/src/database/entities/deal.entity.ts | Adds cleanedUp / cleanedUpAt fields to track deletions. |
| apps/backend/src/config/app.config.ts | Adds Joi validation + config loading for new piece cleanup env vars. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8926d91 to
a2a9bfc
Compare
|
NOte we may be delayed looking at this since it's not a priority for the M4.1 mainnet readiness milestone we're working on. |
SgtPooki
left a comment
There was a problem hiding this comment.
@Chaitu-Tatipamula the main concern I have with this approach is that we’re using our DB as the source of truth for how much data an SP currently has stored. I think that needs to be inverted. But there are two other main concerns below as well
We should use synapse-sdk / filecoin-pin first to determine the provider’s actual live stored size, and only if the provider is over quota should we query our DB to decide which pieces are eligible for cleanup. The DB is useful for candidate selection and bookkeeping, but it shouldn’t be the thing that decides whether we’re over quota, because it can drift from chain/provider reality.
I’d also want the cleanup path to reconcile DB state after deletion so we don’t keep records around for pieces that are no longer live. That matters not just for quota accounting, but also because those rows are what we use to decide what retrieval work to run. If a piece is already gone from the provider, we should clean up the DB state as part of reconciliation rather than continuing to treat it as stored data.
Concretely, I think the flow should be:
- Ask Synapse/filecoin-pin for the SP’s current live usage.
- If over quota, query the DB for the oldest eligible pieces to delete.
- After each successful delete, update/remove the DB record so retrieval scheduling stays correct.
- Add tests around DB/provider drift so we don’t block deals or schedule retrievals based on stale rows.
I don’t think this should run hourly by default. #281 doesn’t require that cadence, and I’d prefer cleanup to run once per day unless we have evidence we need it to be more aggressive. The current PR models this as runs-per-hour with a default of 1, which feels too frequent for a maintenance/reconciliation job.
I think this needs a second threshold, not just a max quota. When we detect an SP is over quota, we should clean down to a configurable target or create a configurable amount of free headroom, not stop at “barely under quota.” That target needs to be chosen relative to the cleanup cadence and expected growth rate. Otherwise a daily/weekly cleanup can end up either thrashing near the threshold or spending an unbounded amount of time deleting pieces in one run. I’d like to see configurable high-water/low-water behavior here, plus a bounded per-run cleanup budget.
will implement using filecoin-pin's i've gone through both libraries and found filecoin-pin has a purpose-built
combined with
The PieceInfo type also has a status enum (ACTIVE, PENDING_REMOVAL, ONCHAIN_ORPHANED, OFFCHAIN_ORPHANED) which will be useful for the reconciliation concern as well.
this is partially implemented, will add explicit drift tests. the current implementation already handles reconciliation:
i can add explicit tests covering the DB/provider drift scenario: or would you suggest removing the DB record altogether?
will change from 1/hour to 1/day.
will add a configurable low-water mark (TARGET) and document the headroom rationale New env var: TARGET_DATASET_STORAGE_SIZE_BYTES defaulted to 20 GiB. when cleanup triggers (live usage exceeds the 24 GiB MAX), it will delete oldest pieces until usage drops below the 20 GiB TARGET, creating ~4 GiB of headroom. at the current growth rate (4 deals/SP/hour × 10 MiB = ~960 MiB/day), 4 GiB of headroom gives ~4 days before the high-water mark is hit again, which aligns well with a daily cleanup cadence. For the bounded budget: runtime budget: already have MAX_PIECE_CLEANUP_RUNTIME_SECONDS (default 300s) let me know if i am on right path @SgtPooki ! |
…ce cleanup - Add MAX_DATASET_STORAGE_SIZE_BYTES (24 GiB default), JOB_PIECE_CLEANUP_PER_SP_PER_HOUR (default 1), MAX_PIECE_CLEANUP_RUNTIME_SECONDS (default 300s) to app.config.ts - Add IPieceCleanupConfig and extend IJobsConfig with cleanup rate/timeout fields - Add cleanedUp and cleanedUpAt columns to Deal entity - Add piece_cleanup to JobType union in job-schedule-state entity - Add PIECE_CLEANUP_QUEUE constant to job-queues.ts - Update docs/environment-variables.md with new variable documentation
- Add PieceCleanupService with while-loop cleanup until under quota - Implement idempotent deletion (not found / already deleted = success) - Add no-progress bail-out guard to prevent infinite loops - Add isProviderOverQuota() for deal creation gating - Add over-quota gating in handleDealJob (skip deal if SP above quota) - Use jobs.maxPieceCleanupRuntimeSeconds for timeout in handlePieceCleanupJob - Add PieceCleanupModule to JobsModule and AppModule imports - Update jobs.service.spec.ts config mock for new structure - Add 23 unit tests covering all edge cases
- Removed unused `PIECE_CLEANUP_QUEUE` constant. Piece cleanup jobs correctly run on the shared `sp.work` singleton queue. - Fixed a timeout leak in `handleDealJob` by moving the SP over-quota check before setting up the `AbortController` timer. - Optimized `cleanupPiecesForProvider` to create a single `StorageContext` before the batch deletion loop.
- Drop dataSetId from the deletePiece - Align getStoredBytesForProvider with getCleanupCandidates - Add unit tests for over-quota gating in handleDealJob
…ounter pieceSize is set atomically with pieceId and DEAL_CREATED status by the upload pipeline. There is no real scenario where a cleanup candidate has pieceSize = 0.
…ot a function" warnings
c134a04 to
679f220
Compare
- Query filecoin-pin's calculateActualStorage() for real-time SP usage instead of relying on DB SUM(piece_size) for quota checks - Fall back to DB when live query is unavailable - Add high-water (MAX) / low-water (TARGET) thresholds so cleanup deletes down to 20 GiB instead of stopping at 24 GiB boundary - Change default cleanup cadence from hourly to daily - Exclude cleaned-up deals from retrieval scheduling queries - Add drift tests for deal blocking and retrieval selection - Update environment-variables docs
672514d to
bb16d11
Compare
Summary
Adds a piece_cleanup job type that periodically deletes the oldest pieces from each SP once total stored data exceeds a configurable quota (
MAX_DATASET_STORAGE_SIZE_BYTES, default 24GiB). Cleanup runs on the existing sp.work queue with singleton-per-SP semantics, ensuring it doesn't overlap with deal/retrieval work.Closes #281
What it does
FIFO) until SP is back under quotaMAX_PIECE_CLEANUP_RUNTIME_SECONDS(default 300s)JOB_PIECE_CLEANUP_PER_SP_PER_HOUR(default 1), consistent withDEALS_PER_SP_PER_HOURpatternChanges
Config & Entity
MAX_DATASET_STORAGE_SIZE_BYTES,JOB_PIECE_CLEANUP_PER_SP_PER_HOUR,MAX_PIECE_CLEANUP_RUNTIME_SECONDScleaned_up(bool) andcleaned_up_at(timestamptz) columnsJobTypeunionCore Service (
piece-cleanup/)PieceCleanupService— cleanup loop, quota check, Synapse SDK integrationisProviderOverQuota()— used by deal handler to gate new dealsJobs Integration (
jobs/)handlePieceCleanupJobwith AbortController timeouthandleDealJobgetIntervalSecondsForRatesensureScheduleRowsDocumentation
docs/environment-variables.md— 3 new env varsdocs/jobs.md— piece_cleanup in job types table, capacity formuladocs/runbooks/jobs.md— pause/resume/trigger SQL examplesdocs/checks/production-configuration-and-approval-methodology.md— updated FAQTesting
piece-cleanup.service.spec.tsjobs.service.spec.tsfor new config structurethings to be considered