Skip to content

Support persist sandbox metadaba to database#730

Open
zhangjaycee wants to merge 13 commits intoalibaba:masterfrom
zhangjaycee:feature/db
Open

Support persist sandbox metadaba to database#730
zhangjaycee wants to merge 13 commits intoalibaba:masterfrom
zhangjaycee:feature/db

Conversation

@zhangjaycee
Copy link
Copy Markdown
Collaborator

close #729

Add DatabaseConfig dataclass (url field) to rock/config.py and wire it
into RockConfig both as a field and in the from_env() YAML parser.
- Add Base(DeclarativeBase) as the single SQLAlchemy declarative base
- Add SandboxRecord ORM model with all sandbox metadata columns
- Add LIST_BY_ALLOWLIST and _NOT_NULL_DEFAULTS class-level constants
- Add DatabaseProvider with async engine/session factory
- Add DatabaseConfig dataclass to RockConfig
- _convert_url handles sqlite://, postgresql://, and postgres:// (Heroku)
  shorthand; URLs with existing driver specifier pass through unchanged
- Default state column value uses string literal "pending" instead of
  State.PENDING enum instance for explicit column semantics
- Add SandboxTable with insert/get/update/delete/list_by/list_by_in
- _filter_data strips unknown keys; _NOT_NULL_DEFAULTS fills NOT NULL cols
- LIST_BY_ALLOWLIST prevents arbitrary column queries (injection guard)
- _record_to_sandbox_info uses lru_cache to avoid repeated get_type_hints
  calls in bulk list_by scenarios
- Add SandboxInfoField generated type and generation script
- Redis alive/timeout keys remain the source of truth for live state
- DB writes are fire-and-forget via asyncio.create_task + _safe_db_call
- batch_get: Redis hits served directly; DB fallback uses a single
  list_by_in("sandbox_id", miss_ids) query instead of N serial gets,
  leveraging the primary key index for O(1) lookup per row
- iter_alive_sandbox_ids queries DB by state IN (running, pending)
  instead of Redis scan_iter, enabling indexed filtering
…e to meta_repo

- Replace MetaStore with SandboxRepository throughout SandboxManager,
  GemManager, BaseManager, and SandboxProxyService
- Wire SandboxRepository (Redis + SandboxTable) in admin/main.py startup
- stop(): add early return after archive() in the ValueError except branch
  to prevent double archive when the Ray actor is already gone

Made-with: Cursor
- Add TestSandboxTableWithSQLite: full CRUD coverage using SQLite
  in-memory database (no external dependencies, runs in fast CI)
  including list_by_in, NOT NULL defaults, and noop-on-missing-id cases
- Add TestSandboxTableWithPostgres: PostgreSQL-specific tests (JSONB,
  real container) marked need_docker + need_database
- Add comprehensive SandboxRepository tests: create/update/delete/archive/
  get/exists/batch_get/list_by/refresh_timeout/is_expired
- Consistent lowercase "stopped" state string throughout test data,
  matching the State enum value convention (running/pending)
- Add single-column indexes on all commonly queried fields (user_id,
  state, namespace, experiment_id, cluster_name, image, host_ip,
  host_name, create_user_gray_flag)
- Add scripts/gen_ddl.py to emit CREATE TABLE / CREATE INDEX DDL
- Add *.db and ddl/ to .gitignore (generated artifacts)
OperatorContext was missing redis_provider, leaving RayOperator._redis_provider
as None. This caused the use_rocklet get_status path to crash with
'NoneType object has no attribute get' because build_sandbox_from_redis
skips the lookup entirely when redis_provider is None.
- Rename class SandboxRepository to SandboxMetaStore to better reflect its role
  as a coordinator for Redis (hot path) + DB (query path) dual-write
- Rename _meta_repo to _meta_store across all files
- Rename sandbox_repository.py to sandbox_meta_store.py
- Update all imports and references
- Use legacy states (_TERMINAL_STATES, _LIST_BY_BLACKLIST) from SandboxMetaStore
  as the authoritative source; removed duplicate definitions elsewhere
- Add rock/sandbox/utils/timeout.py with SandboxTimeoutHelper:
  pure calculation helpers (make_timeout_info, refresh_timeout, is_expired)
  with no I/O dependency
- Add SandboxMetaStore.update_timeout() for raw Redis set of timeout key;
  remove refresh_timeout() and is_expired() from MetaStore (not its responsibility)
- SandboxManager._refresh_timeout / _is_expired: own the I/O (get_timeout +
  update_timeout) and delegate calculation to SandboxTimeoutHelper
- SandboxProxyService._update_expire_time: same pattern
- Replace inline auto_clear_time_dict construction in start_async with
  SandboxTimeoutHelper.make_timeout_info()
- Update tests: replace TestRefreshTimeout/TestIsExpired in test_sandbox_meta_store
  with TestUpdateTimeout; add test_sandbox_timeout.py for pure unit tests
Schema
- Add spec (JSONB): DockerDeploymentConfig.model_dump() snapshot, written
  once at creation, never updated
- Add status (JSONB): full SandboxInfo snapshot, overwritten on every update

SandboxRecordData
- New TypedDict in schema.py extending SandboxInfo with spec/status fields
- Used as the unified I/O type for SandboxTable (replaces plain SandboxInfo)

SandboxTable
- create(): writes spec from caller; auto-populates status from data
- update(): always overwrites status with latest SandboxInfo snapshot
- list_by / list_by_in / get return SandboxRecordData

SandboxMetaStore
- create() gains spec: dict | None parameter; constructs SandboxRecordData
  before passing to SandboxTable; Redis path unchanged

SandboxManager
- start_async passes spec=docker_deployment_config.model_dump() to meta_store

Cleanup
- Remove SandboxInfoField generated Literal type and its generation script
- Replace SandboxInfoField with plain str in SandboxTable / SandboxMetaStore
  (LIST_BY_ALLOWLIST already enforces valid column names at runtime)
… in SandboxTable

- Remove async_sessionmaker, _session_factory, and session() factory from DatabaseProvider
- Add engine property that raises RuntimeError if not initialised
- Update all SandboxTable methods to use AsyncSession(self._db.engine) directly
- Simpler, more explicit session lifecycle with no factory indirection
async def list_sandboxes(self, query_params: SandboxQueryParams) -> SandboxListResponse:
if self._redis_provider is None:
logger.warning("Redis provider is not available, list_sandboxes returning empty result")
async def list_sandboxes(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get接口兼容性设计。如果stop,仍要报错。一期跟原逻辑保持一致。二期加参数判断

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get 接口这个 PR 之前前 stop 会报错,所以先不兼容;list 接口默认兼容原来的两个状态 (PENDING/RUNNING),在 query params 加入 use_legacy_states=false 会增加返回 STOPPED 状态的 sandboxes

logger.info(f"list sandboxes with filters: {query_params}, page: {page}, page_size: {page_size}")
try:
all_sandbox_data = await self.list_all_sandboxes_by_query_params(query_params)
all_sandbox_data = await self.list_all_sandboxes_by_query_params(query_params, use_legacy_states)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不加新接口。可通过参数控制

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请求可以用 use_legacy_states 参数控制,默认use_legacy_states=true兼容原接口行为,可以用use_legacy_states=false返回带 stopped 状态的 sandbox 列表

sandbox_table: SandboxTable | None = None,
) -> None:
self._redis: RedisProvider | None = redis_provider
self._db: SandboxTable | None = sandbox_table
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db一定存在。去掉判断。

后面把redis也重构一下,一定存在。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db_url 不指定时用 sqlite-memory 模式;redis config 不指定时用 fakeredis 库

- Fallback to sqlite-memory when database.url is not configured
- Fallback to FakeRedis when redis.host is not configured
- SandboxMetaStore now requires both providers (no more None checks)
- batch_get() returns only found sandboxes (no positional None slots)
- list_by() raises ValueError for non-allowlisted fields (no Redis fallback)
- list_sandboxes and batch_get_status expose use_legacy_states param
- Update tests to reflect new behaviour
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support persist sandbox info to databases

2 participants