Skip to content

feat: implement instance backup and restore system#91

Open
hippoley wants to merge 2 commits intoYuan-lab-LLM:mainfrom
hippoley:feat/instance-backup-system
Open

feat: implement instance backup and restore system#91
hippoley wants to merge 2 commits intoYuan-lab-LLM:mainfrom
hippoley:feat/instance-backup-system

Conversation

@hippoley
Copy link
Copy Markdown
Contributor

@hippoley hippoley commented Apr 24, 2026

Instance Backup System

Implements the complete backup system for OpenClaw instances, filling in the empty backups and backup_schedules DB schema that was already defined in 001_init_schema.sql.

Part 1: Manual Backup & Restore

  • BackupRepository: CRUD operations for the backups table
  • BackupService: full lifecycle — create, list, get, delete, restore
  • K8s Job-based backup mechanism: tar.gz archiving of HostPath instance data via busybox Jobs
  • Async job monitoring with race-condition-safe status updates (markBackupCompleted/markBackupFailed re-read DB state before writing)
  • TTLSecondsAfterFinished for automatic K8s Job cleanup
  • BackupHandler: 5 HTTP endpoints (create, list, get, delete, restore)
  • 13 unit tests covering service logic, validation, ownership checks, and edge cases

Part 2: Scheduled Backups & Expiry Cleanup

  • BackupScheduleRepository: CRUD + ListAllActive() for the backup_schedules table
  • BackupScheduler: background loop (60s tick interval) with a minimal cron expression parser
    • Supports @hourly, @daily, @weekly, @monthly presets and standard 5-field cron (minute hour dom month dow)
    • Field tokens: *, numbers, ranges (1-5), steps (*/6, 1-5/2), lists (1,3,5)
    • All times evaluated in UTC, consistent with K8s CronJob behavior
  • CreateScheduledBackup: system-actor backup creation (no user-ownership check) with auto-computed expires_at
  • BackupScheduleHandler: 4 HTTP endpoints for schedule CRUD with cron expression and retention_days validation

Risk Mitigations (built-in)

  • Idempotency guard: 90-second gap check prevents double-fire within the same cron window
  • Expiry cleanup safety: only soft-deletes backups with status = "completed" AND expires_at < now() — never touches creating status
  • Retention minimum: retention_days >= 1 enforced at API level
  • Cron parse failure: gracefully returns false + logs warning, never panics
  • Panic recovery: tick() has defer recover() so the scheduler goroutine survives unexpected panics
  • Stop() double-call safety: sync.Once protects close(stopChan) from double-close panic
  • Tick overlap prevention: sync.Mutex.TryLock() skips a tick if the previous one is still running

Zero Breaking Changes

  • All modifications to existing files are pure additions (no existing function bodies modified)
  • No route conflicts with existing endpoints
  • DB schema already exists — no new migrations needed
  • Scheduler integrates with existing graceful shutdown flow

Test Coverage

  • 23 total tests (13 Part 1 + 10 Part 2)
  • Part 2 tests: cron presets, standard expressions, ranges/steps/lists, invalid expressions, cron validation, scheduler idempotency, cleanup safety, invalid cron skip, double-stop safety, panic recovery
  • Full regression: all existing tests pass

Implement the backup system whose schema and models were already defined
but had no working code. This adds full CRUD operations for manual
instance backups with async Kubernetes Job-based backup/restore.

New files:
- repository/backup_repository.go: data access layer (Create, Get, List,
  Update, Delete, Count)
- services/backup_service.go: business logic with ownership checks,
  per-instance backup limit (20), async K8s Jobs for tar.gz archiving
  and restore, soft-delete with async cleanup
- handlers/backup_handler.go: RESTful HTTP handlers wired under
  /instances/:id/backups
- services/backup_service_test.go: 13 unit tests covering validation,
  ownership, limits, soft-delete, race-condition defense, and restore
  preconditions

Modified files:
- cmd/server/main.go: register backup repo, service, handler and routes
- utils/response.go: map backup-specific errors to proper HTTP status
  codes (400/404)

Key design decisions:
- Backup jobs run as K8s Jobs (busybox) with HostPath volume mounts,
  matching the existing PVC storage model
- Soft-delete pattern: DeleteBackup marks status='deleted' then launches
  async file cleanup, preventing data loss on accidental deletion
- Race-condition defense: markBackupCompleted/markBackupFailed re-read
  DB state before updating, skipping writes if the backup was
  concurrently soft-deleted
- Restore clears the target directory before extraction to guarantee a
  clean state
- Nil-guard on pvcService in all async goroutines so unit tests run
  without a live K8s cluster

Closes: backup system part 1 (manual backup/restore)
Ref: backups table in 001_init_schema.sql
…D API

Part 2 of the instance backup system:

- BackupScheduleRepository: CRUD + ListAllActive for backup_schedules table
- BackupScheduler: background loop (60s tick) with minimal cron parser
  supporting @hourly/@daily/@weekly/@monthly and standard 5-field expressions
- Idempotency guard: 90s gap check prevents double-fire within same cron window
- Expiry cleanup: soft-deletes only completed backups past expires_at
- CreateScheduledBackup: system-actor backup creation with auto-computed expiry
- BackupScheduleHandler: 4 HTTP endpoints for schedule CRUD with validation
- Runtime hardening: panic recovery in tick, sync.Once for Stop(), mutex for
  tick overlap prevention
- 10 new tests covering cron parsing, validation, idempotency, cleanup safety,
  panic recovery, and double-stop protection
@hippoley hippoley force-pushed the feat/instance-backup-system branch from c57c70a to 43c1504 Compare April 27, 2026 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant