feat: implement instance backup and restore system by hippoley · Pull Request #91 · Yuan-lab-LLM/ClawManager

hippoley · 2026-04-24T11:50:46Z

Instance Backup System

Implements the complete backup system for OpenClaw instances, filling in the empty backups and backup_schedules DB schema that was already defined in 001_init_schema.sql.

Part 1: Manual Backup & Restore

BackupRepository: CRUD operations for the backups table
BackupService: full lifecycle — create, list, get, delete, restore
K8s Job-based backup mechanism: tar.gz archiving of HostPath instance data via busybox Jobs
Async job monitoring with race-condition-safe status updates (markBackupCompleted/markBackupFailed re-read DB state before writing)
TTLSecondsAfterFinished for automatic K8s Job cleanup
BackupHandler: 5 HTTP endpoints (create, list, get, delete, restore)
13 unit tests covering service logic, validation, ownership checks, and edge cases

Part 2: Scheduled Backups & Expiry Cleanup

BackupScheduleRepository: CRUD + ListAllActive() for the backup_schedules table
BackupScheduler: background loop (60s tick interval) with a minimal cron expression parser
- Supports @hourly, @daily, @weekly, @monthly presets and standard 5-field cron (minute hour dom month dow)
- Field tokens: *, numbers, ranges (1-5), steps (*/6, 1-5/2), lists (1,3,5)
- All times evaluated in UTC, consistent with K8s CronJob behavior
CreateScheduledBackup: system-actor backup creation (no user-ownership check) with auto-computed expires_at
BackupScheduleHandler: 4 HTTP endpoints for schedule CRUD with cron expression and retention_days validation

Risk Mitigations (built-in)

Idempotency guard: 90-second gap check prevents double-fire within the same cron window
Expiry cleanup safety: only soft-deletes backups with status = "completed" AND expires_at < now() — never touches creating status
Retention minimum: retention_days >= 1 enforced at API level
Cron parse failure: gracefully returns false + logs warning, never panics
Panic recovery: tick() has defer recover() so the scheduler goroutine survives unexpected panics
Stop() double-call safety: sync.Once protects close(stopChan) from double-close panic
Tick overlap prevention: sync.Mutex.TryLock() skips a tick if the previous one is still running

Zero Breaking Changes

All modifications to existing files are pure additions (no existing function bodies modified)
No route conflicts with existing endpoints
DB schema already exists — no new migrations needed
Scheduler integrates with existing graceful shutdown flow

Test Coverage

23 total tests (13 Part 1 + 10 Part 2)
Part 2 tests: cron presets, standard expressions, ranges/steps/lists, invalid expressions, cron validation, scheduler idempotency, cleanup safety, invalid cron skip, double-stop safety, panic recovery
Full regression: all existing tests pass

Implement the backup system whose schema and models were already defined but had no working code. This adds full CRUD operations for manual instance backups with async Kubernetes Job-based backup/restore. New files: - repository/backup_repository.go: data access layer (Create, Get, List, Update, Delete, Count) - services/backup_service.go: business logic with ownership checks, per-instance backup limit (20), async K8s Jobs for tar.gz archiving and restore, soft-delete with async cleanup - handlers/backup_handler.go: RESTful HTTP handlers wired under /instances/:id/backups - services/backup_service_test.go: 13 unit tests covering validation, ownership, limits, soft-delete, race-condition defense, and restore preconditions Modified files: - cmd/server/main.go: register backup repo, service, handler and routes - utils/response.go: map backup-specific errors to proper HTTP status codes (400/404) Key design decisions: - Backup jobs run as K8s Jobs (busybox) with HostPath volume mounts, matching the existing PVC storage model - Soft-delete pattern: DeleteBackup marks status='deleted' then launches async file cleanup, preventing data loss on accidental deletion - Race-condition defense: markBackupCompleted/markBackupFailed re-read DB state before updating, skipping writes if the backup was concurrently soft-deleted - Restore clears the target directory before extraction to guarantee a clean state - Nil-guard on pvcService in all async goroutines so unit tests run without a live K8s cluster Closes: backup system part 1 (manual backup/restore) Ref: backups table in 001_init_schema.sql

@monthly

…D API Part 2 of the instance backup system: - BackupScheduleRepository: CRUD + ListAllActive for backup_schedules table - BackupScheduler: background loop (60s tick) with minimal cron parser supporting @hourly/@daily/@weekly/@monthly and standard 5-field expressions - Idempotency guard: 90s gap check prevents double-fire within same cron window - Expiry cleanup: soft-deletes only completed backups past expires_at - CreateScheduledBackup: system-actor backup creation with auto-computed expiry - BackupScheduleHandler: 4 HTTP endpoints for schedule CRUD with validation - Runtime hardening: panic recovery in tick, sync.Once for Stop(), mutex for tick overlap prevention - 10 new tests covering cron parsing, validation, idempotency, cleanup safety, panic recovery, and double-stop protection

hippoley added 2 commits April 27, 2026 11:39

hippoley force-pushed the feat/instance-backup-system branch from c57c70a to 43c1504 Compare April 27, 2026 05:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement instance backup and restore system#91

feat: implement instance backup and restore system#91
hippoley wants to merge 2 commits intoYuan-lab-LLM:mainfrom
hippoley:feat/instance-backup-system

hippoley commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hippoley commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Instance Backup System

Part 1: Manual Backup & Restore

Part 2: Scheduled Backups & Expiry Cleanup

Risk Mitigations (built-in)

Zero Breaking Changes

Test Coverage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hippoley commented Apr 24, 2026 •

edited

Loading