Skip to content

Conversation

@zfarrell
Copy link
Contributor

@zfarrell zfarrell commented Jan 16, 2026

Summary

  • Replaces /connections/{id}/discover with unified POST /refresh for schema + data refresh
  • Adds versioned cache writes (atomic swaps) and deferred deletion with a background worker
  • Adds refresh warnings and better error mapping for not-found tables

Details

  • New refresh request/response models with optional warnings
  • Storage refactor: versioned directories + cache write handle (filesystem + S3)
  • Catalog support for pending deletions with retry tracking
  • Extensive new tests for refresh behavior, storage, and deletion flow

Breaking Changes

  • Removes /connections/{id}/discover; use POST /refresh instead

API

The /refresh endpoint provides a unified interface for refreshing schema metadata and cached data. It supports multiple operation modes depending on the parameters provided.

Endpoint

POST /refresh
Content-Type: application/json

Request Parameters

Parameter Type Required Description
connection_id string No External connection ID to target. If omitted, operates on all connections.
schema_name string No Database schema name. Required when targeting a specific table.
table_name string No Table name. Requires schema_name and connection_id.
data boolean No If true, refreshes cached parquet data. If false (default), refreshes schema metadata only.
include_uncached boolean No When data=true on a connection-wide refresh, also sync tables that aren't already cached. Default: false.

Operation Modes

1. Schema Refresh: All Connections

Discovers tables from all configured connections and updates the catalog.

POST /refresh
{}

Response:

{
  "connections_refreshed": 2,
  "connections_failed": 0,
  "tables_discovered": 15,
  "tables_added": 3,
  "tables_modified": 1
}

If any connections fail (e.g., due to missing credentials), the operation continues with remaining connections and reports failures:

{
  "connections_refreshed": 1,
  "connections_failed": 1,
  "tables_discovered": 10,
  "tables_added": 2,
  "tables_modified": 0,
  "errors": [
    {
      "connection_id": "broken_conn",
      "error": "Failed to connect: missing credentials"
    }
  ]
}

2. Schema Refresh: Single Connection

Discovers tables from a specific connection.

POST /refresh
{
  "connection_id": "my_postgres"
}

Response:

{
  "connections_refreshed": 1,
  "connections_failed": 0,
  "tables_discovered": 8,
  "tables_added": 1,
  "tables_modified": 0
}

3. Data Refresh: Single Table

Re-fetches data for a specific table and updates the parquet cache.

POST /refresh
{
  "connection_id": "my_postgres",
  "schema_name": "public",
  "table_name": "orders",
  "data": true
}

Response:

{
  "connection_id": "my_postgres",
  "schema_name": "public",
  "table_name": "orders",
  "rows_synced": 15000,
  "duration_ms": 1234
}

Returns 404 if the table doesn't exist in the catalog.

4. Data Refresh: All Tables in Connection

Re-fetches data for all cached tables in a connection.

POST /refresh
{
  "connection_id": "my_postgres",
  "data": true
}

Response:

{
  "connection_id": "my_postgres",
  "tables_refreshed": 5,
  "tables_failed": 1,
  "total_rows": 50000,
  "duration_ms": 5678,
  "errors": [
    {
      "schema_name": "public",
      "table_name": "broken_table",
      "error": "Query failed: relation does not exist"
    }
  ]
}

By default, only tables that already have cached data are refreshed. To also sync uncached tables:

POST /refresh
{
  "connection_id": "my_postgres",
  "data": true,
  "include_uncached": true
}

Warnings

Non-fatal issues (like failed cleanup of old cache files) are reported in a warnings array without failing the operation:

{
  "connection_id": "my_postgres",
  "schema_name": "public",
  "table_name": "orders",
  "rows_synced": 15000,
  "duration_ms": 1234,
  "warnings": [
    {
      "schema_name": "public",
      "table_name": "orders",
      "message": "Failed to schedule deletion of old cache: disk full"
    }
  ]
}

The warnings field is omitted when empty.

Invalid Combinations

Request Error
data=true without connection_id 400 Bad Request - data refresh requires connection_id
schema_name without table_name (schema=false) 400 Bad Request - schema-level refresh not supported
schema_name without table_name (data=true) 400 Bad Request - data refresh with schema requires table_name
table_name without schema_name 400 Bad Request - table_name requires schema_name
schema_name or table_name with schema refresh 400 Bad Request - schema refresh cannot target specific tables

Error Responses

  • 400 Bad Request - Invalid parameter combination
  • 404 Not Found - Connection or table not found
  • 500 Internal Server Error - Unexpected server error

Replace tests using the removed /discover endpoint with tests for the
new /refresh endpoint. Remove unused discover_connection_handler and
DiscoverConnectionResponse.
- Consolidate prepare_cache_write to always use versioned directories
- Add CacheWriteHandle struct for atomic data refresh operations
- Remove delete_stale_tables (too error-prone, doesn't handle cleanup)
- Rename get_due_deletions to get_pending_deletions
- Add include_uncached flag to connection data refresh (default: false)
Resolve conflicts from origin/main's new migration system refactor.
Add v2.sql and v3.sql migrations for pending_deletions table.
- Add warning log on timestamp parse failure instead of silent fallback
- Defer deletion by 24h on parse error to prevent premature data loss
- Expand include_uncached documentation with usage context
- Add comment explaining temp_dir lifetime in test harness
- Add retry_count to pending_deletions for stuck deletion handling
- Remove records after MAX_DELETION_RETRIES (5) failed attempts
- Make deletion worker interval configurable via builder
- Make parallel refresh count configurable via builder
- Extract magic numbers to named constants
- Handle S3 temp file removal failure gracefully with warning
S3Storage::delete now properly handles directory URLs by listing and
deleting all objects with that prefix, matching FilesystemStorage
behavior. Migrated S3 tests to use testcontainers for automatic MinIO
orchestration.
The tables_removed field was misleading - it reported tables detected
as removed from the remote source but didn't actually delete them from
the catalog. Removed the field entirely to avoid confusing API consumers.
All call sites that delete versioned cache directories now use
delete_prefix instead of delete. Reverted S3Storage::delete to only
handle single files. This makes the API clearer: delete for files,
delete_prefix for directories/prefixes.
@zfarrell zfarrell marked this pull request as ready for review January 16, 2026 21:37
@zfarrell zfarrell merged commit 3206eb7 into main Jan 16, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants