feat(BA-4904): add GraphQL ResourceSlotType node with root queries and connections#9708
feat(BA-4904): add GraphQL ResourceSlotType node with root queries and connections#9708HyeockJinKim wants to merge 17 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds new GraphQL surface area for resource slot metadata and per-entity slot usage/allocation, backed by new service actions and shared fetcher helpers.
Changes:
- Introduces
ResourceSlotTypeGQL(+NumberFormat) and root queriesresource_slot_type/resource_slot_types. - Adds Relay-style connection fields on
AgentV2GQL(resource_slots) andKernelV2GQL(resource_allocations). - Extends resource-slot service/processors/actions to support fetching slot-type registry entries.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ai/backend/manager/services/resource_slot/service.py | Adds service methods to fetch all slot types and a single slot type, mapping repository rows into data objects. |
| src/ai/backend/manager/services/resource_slot/processors.py | Registers new action processors for slot-type actions. |
| src/ai/backend/manager/services/resource_slot/actions/get_slot_type.py | Adds action/result for fetching a single slot type. |
| src/ai/backend/manager/services/resource_slot/actions/all_slot_types.py | Adds action/result for fetching all slot types. |
| src/ai/backend/manager/services/resource_slot/actions/init.py | Exposes new actions/results via package exports. |
| src/ai/backend/manager/data/resource_slot/types.py | Adds NumberFormatData and extends ResourceSlotTypeData with additional fields. |
| src/ai/backend/manager/api/gql/schema.py | Wires new root query resolvers into the GraphQL schema. |
| src/ai/backend/manager/api/gql/resource_slot/types.py | Introduces new GraphQL Node + Connection types for slot types, agent resources, and kernel allocations. |
| src/ai/backend/manager/api/gql/resource_slot/resolver.py | Adds root query resolvers for slot type queries. |
| src/ai/backend/manager/api/gql/resource_slot/fetcher.py | Adds shared fetcher utilities for root queries and connection fields. |
| src/ai/backend/manager/api/gql/resource_slot/init.py | Adds package marker. |
| src/ai/backend/manager/api/gql/kernel/types.py | Adds resource_allocations connection field on KernelV2GQL. |
| src/ai/backend/manager/api/gql/agent/types.py | Adds resource_slots connection field on AgentV2GQL. |
| src/ai/backend/common/data/permission/types.py | Adds new RBAC entity type RESOURCE_SLOT_TYPE. |
| changes/9708.feature.md | Adds changelog entry for the new GraphQL nodes/queries/fields. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| @strawberry.field(description="Added in 26.4.0. Returns all registered resource slot types.") # type: ignore[misc] | ||
| async def resource_slot_types( | ||
| info: Info[StrawberryGQLContext], |
There was a problem hiding this comment.
This root field returns a Relay Connection but does not accept any pagination arguments (e.g., first/after/last/before). As written, clients cannot paginate and page_info is always hardcoded in the fetcher; either expose proper Relay pagination parameters (and implement slicing + has_next_page/has_previous_page) or change the API to return a plain list instead of a Connection.
| info: Info[StrawberryGQLContext], | |
| info: Info[StrawberryGQLContext], | |
| first: int | None = None, | |
| after: str | None = None, | |
| last: int | None = None, | |
| before: str | None = None, |
| page_info = strawberry.relay.PageInfo( | ||
| has_next_page=False, | ||
| has_previous_page=False, | ||
| start_cursor=edges[0].cursor if edges else None, | ||
| end_cursor=edges[-1].cursor if edges else None, | ||
| ) |
There was a problem hiding this comment.
These connections are constructed with has_next_page/has_previous_page hardcoded to False, which produces misleading Relay semantics once the dataset grows. If the field is intended to be a real Relay connection, compute these flags based on the requested window (first/after/...) and the underlying total; otherwise consider returning a list type to avoid implying pagination support.
| @classmethod | ||
| async def resolve_nodes( # type: ignore[override] | ||
| cls, | ||
| *, | ||
| info: Info[StrawberryGQLContext], | ||
| node_ids: Iterable[str], | ||
| required: bool = False, | ||
| ) -> Iterable[Self | None]: |
There was a problem hiding this comment.
The required flag is part of Strawberry's Node resolution contract, but it is currently ignored in all three resolve_nodes implementations in this file. When required=True and a node is missing, the resolver should raise (instead of returning None) to match the expected behavior for non-null node lookups.
| results: list[Self | None] = [] | ||
| for slot_name in node_ids: | ||
| data = await load_resource_slot_type_data(info, slot_name) | ||
| results.append(cls.from_data(data) if data is not None else None) | ||
| return results |
There was a problem hiding this comment.
This performs one awaited fetch per node_id, causing an N+1 pattern for Relay node resolution. Prefer batching: fetch all requested slot_names in one service/repository call (or through a request-scoped DataLoader), then map results back to the original node_ids order.
| """Load raw AgentResourceData for a single agent+slot (used by Node.resolve_nodes).""" | ||
| action_result = ( | ||
| await info.context.processors.resource_slot.get_agent_resources.wait_for_complete( | ||
| GetAgentResourcesAction(agent_id=agent_id) | ||
| ) | ||
| ) | ||
| for data in action_result.items: | ||
| if data.slot_name == slot_name: | ||
| return data | ||
| return None | ||
|
|
||
|
|
||
| async def load_kernel_allocation_data( | ||
| info: Info[StrawberryGQLContext], | ||
| kernel_id_str: str, | ||
| slot_name: str, | ||
| ) -> ResourceAllocationData | None: | ||
| """Load raw ResourceAllocationData for a single kernel+slot (used by Node.resolve_nodes).""" | ||
| action_result = ( | ||
| await info.context.processors.resource_slot.get_kernel_allocations.wait_for_complete( | ||
| GetKernelAllocationsAction(kernel_id=_uuid.UUID(kernel_id_str)) | ||
| ) | ||
| ) | ||
| for data in action_result.items: | ||
| if data.slot_name == slot_name: | ||
| return data | ||
| return None |
There was a problem hiding this comment.
For per-slot Node resolution you re-fetch the full agent resource list and linearly scan it for every requested slot. This is O(N*M) across multiple nodes and can produce repeated identical backend calls. Consider adding a dedicated service method/action to fetch a single slot (agent_id + slot_name), or batch: fetch once per agent_id and build a dict keyed by slot_name.
| """Load raw AgentResourceData for a single agent+slot (used by Node.resolve_nodes).""" | |
| action_result = ( | |
| await info.context.processors.resource_slot.get_agent_resources.wait_for_complete( | |
| GetAgentResourcesAction(agent_id=agent_id) | |
| ) | |
| ) | |
| for data in action_result.items: | |
| if data.slot_name == slot_name: | |
| return data | |
| return None | |
| async def load_kernel_allocation_data( | |
| info: Info[StrawberryGQLContext], | |
| kernel_id_str: str, | |
| slot_name: str, | |
| ) -> ResourceAllocationData | None: | |
| """Load raw ResourceAllocationData for a single kernel+slot (used by Node.resolve_nodes).""" | |
| action_result = ( | |
| await info.context.processors.resource_slot.get_kernel_allocations.wait_for_complete( | |
| GetKernelAllocationsAction(kernel_id=_uuid.UUID(kernel_id_str)) | |
| ) | |
| ) | |
| for data in action_result.items: | |
| if data.slot_name == slot_name: | |
| return data | |
| return None | |
| """Load raw AgentResourceData for a single agent+slot (used by Node.resolve_nodes). | |
| To avoid repeated backend calls and linear scans when resolving multiple slots for | |
| the same agent within a single request, this function caches the full list of | |
| resources per agent_id on the GraphQL context and indexes them by slot_name. | |
| """ | |
| # Per-request cache: info.context._agent_resources_cache | |
| ctx = info.context | |
| cache = getattr(ctx, "_agent_resources_cache", None) | |
| if cache is None: | |
| cache = {} | |
| setattr(ctx, "_agent_resources_cache", cache) | |
| agent_cache = cache.get(agent_id) | |
| if agent_cache is None: | |
| action_result = ( | |
| await ctx.processors.resource_slot.get_agent_resources.wait_for_complete( | |
| GetAgentResourcesAction(agent_id=agent_id) | |
| ) | |
| ) | |
| # Index resources by slot_name for O(1) lookup. | |
| agent_cache = {item.slot_name: item for item in action_result.items} | |
| cache[agent_id] = agent_cache | |
| return agent_cache.get(slot_name) | |
| async def load_kernel_allocation_data( | |
| info: Info[StrawberryGQLContext], | |
| kernel_id_str: str, | |
| slot_name: str, | |
| ) -> ResourceAllocationData | None: | |
| """Load raw ResourceAllocationData for a single kernel+slot (used by Node.resolve_nodes). | |
| Similar to load_agent_resource_data(), this caches allocations per kernel_id within | |
| a single request to prevent repeated backend calls and linear scans when multiple | |
| slots are resolved for the same kernel. | |
| """ | |
| # Per-request cache: info.context._kernel_allocations_cache | |
| ctx = info.context | |
| cache = getattr(ctx, "_kernel_allocations_cache", None) | |
| if cache is None: | |
| cache = {} | |
| setattr(ctx, "_kernel_allocations_cache", cache) | |
| kernel_cache = cache.get(kernel_id_str) | |
| if kernel_cache is None: | |
| action_result = ( | |
| await ctx.processors.resource_slot.get_kernel_allocations.wait_for_complete( | |
| GetKernelAllocationsAction(kernel_id=_uuid.UUID(kernel_id_str)) | |
| ) | |
| ) | |
| # Index allocations by slot_name for O(1) lookup. | |
| kernel_cache = {item.slot_name: item for item in action_result.items} | |
| cache[kernel_id_str] = kernel_cache | |
| return kernel_cache.get(slot_name) |
| items = [ | ||
| ResourceSlotTypeData( | ||
| slot_name=row.slot_name, | ||
| slot_type=row.slot_type, | ||
| display_name=row.display_name, | ||
| description=row.description, | ||
| display_unit=row.display_unit, | ||
| display_icon=row.display_icon, | ||
| number_format=NumberFormatData( | ||
| binary=row.number_format.binary, | ||
| round_length=row.number_format.round_length, | ||
| ), | ||
| rank=row.rank, | ||
| ) |
There was a problem hiding this comment.
The row→ResourceSlotTypeData mapping logic is duplicated in both all_slot_types() and get_slot_type(). Extract a small private helper (e.g., _to_resource_slot_type_data(row)) to keep the mapping consistent and reduce the chance of future drift when columns are added/changed.
| async def load_agent_resource_data( | ||
| info: Info[StrawberryGQLContext], | ||
| agent_id: str, | ||
| slot_name: str, | ||
| ) -> AgentResourceData | None: | ||
| """Load raw AgentResourceData for a single agent+slot (used by Node.resolve_nodes).""" | ||
| action_result = ( | ||
| await info.context.processors.resource_slot.get_agent_resources.wait_for_complete( | ||
| GetAgentResourcesAction(agent_id=agent_id) | ||
| ) | ||
| ) | ||
| for data in action_result.items: | ||
| if data.slot_name == slot_name: | ||
| return data | ||
| return None |
There was a problem hiding this comment.
It appears a separate query & service action is needed that accepts input up to the slot_name, not just agent resources.
| async def load_kernel_allocation_data( | ||
| info: Info[StrawberryGQLContext], | ||
| kernel_id_str: str, | ||
| slot_name: str, | ||
| ) -> ResourceAllocationData | None: | ||
| """Load raw ResourceAllocationData for a single kernel+slot (used by Node.resolve_nodes).""" | ||
| action_result = ( | ||
| await info.context.processors.resource_slot.get_kernel_allocations.wait_for_complete( | ||
| GetKernelAllocationsAction(kernel_id=_uuid.UUID(kernel_id_str)) | ||
| ) | ||
| ) | ||
| for data in action_result.items: | ||
| if data.slot_name == slot_name: | ||
| return data | ||
| return None |
There was a problem hiding this comment.
It seems we should receive it here rather than separating it after slot_name.
|
|
||
|
|
||
| @strawberry.field( | ||
| description="Added in 26.4.0. Returns a single resource slot type by slot_name, or null." |
There was a problem hiding this comment.
Please set the target version to 26.3.0 for all.
| async def all_slot_types(self, action: AllSlotTypesAction) -> AllSlotTypesResult: | ||
| rows = await self._repository.all_slot_types() |
There was a problem hiding this comment.
I want to provide the search functionality that was originally offered, rather than the 'all' option. I don't want to provide the 'all' option.
| async def resource_allocations( | ||
| self, | ||
| info: Info[StrawberryGQLContext], | ||
| ) -> Annotated[ | ||
| ResourceAllocationConnectionGQL, | ||
| strawberry.lazy("ai.backend.manager.api.gql.resource_slot.types"), | ||
| ]: | ||
| """Fetch per-slot resource allocation for this kernel.""" | ||
| from ai.backend.manager.api.gql.resource_slot.fetcher import fetch_kernel_allocations | ||
|
|
||
| return await fetch_kernel_allocations(info=info, kernel_id=str(self.id)) |
There was a problem hiding this comment.
Refer to the pattern where requests for connection passed all arguments such as filter, order, before, etc.
| async def resource_slots( | ||
| self, | ||
| info: Info[StrawberryGQLContext], | ||
| first: int | None = None, | ||
| after: str | None = None, | ||
| last: int | None = None, | ||
| before: str | None = None, | ||
| limit: int | None = None, | ||
| offset: int | None = None, | ||
| ) -> Annotated[ | ||
| AgentResourceConnectionGQL, | ||
| strawberry.lazy("ai.backend.manager.api.gql.resource_slot.types"), | ||
| ]: |
There was a problem hiding this comment.
The filter and order are missing.
| async def resource_allocations( | ||
| self, | ||
| info: Info[StrawberryGQLContext], | ||
| first: int | None = None, | ||
| after: str | None = None, | ||
| last: int | None = None, | ||
| before: str | None = None, | ||
| limit: int | None = None, | ||
| offset: int | None = None, | ||
| ) -> Annotated[ |
There was a problem hiding this comment.
The filter and order are missing.
| async def fetch_agent_resources( | ||
| info: Info[StrawberryGQLContext], | ||
| agent_id: str, | ||
| before: str | None = None, | ||
| after: str | None = None, | ||
| first: int | None = None, | ||
| last: int | None = None, | ||
| limit: int | None = None, | ||
| offset: int | None = None, | ||
| ) -> AgentResourceConnectionGQL: |
There was a problem hiding this comment.
The filter and order are missing. Instead of receiving agent_id, we need to verify the existing structure that receives scope, filter, and order.
…d connections - Add ResourceSlotTypeGQL(Node) exposing all resource_slot_types columns (slot_name, slot_type, display_name, description, display_unit, display_icon, number_format, rank) with ResourceSlotTypeConnectionGQL - Add AgentResourceSlotGQL(Node) for per-slot capacity/usage on agents with AgentResourceConnectionGQL; wire as resource_slots field on AgentV2GQL - Add KernelResourceAllocationGQL(Node) for per-slot allocation on kernels with ResourceAllocationConnectionGQL; wire as resource_allocations field on KernelV2GQL - Add root queries resource_slot_type(slot_name) and resource_slot_types() - Shared fetcher functions reused across root queries and connection resolvers - Add AllSlotTypesAction/GetSlotTypeAction to ResourceSlotService and processors - Add NumberFormatData to data layer; add RESOURCE_SLOT_TYPE to EntityType enum Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_data() Replace fetcher-returning-GQL-type pattern in resolve_nodes with data-returning helpers + cls.from_data() calls, following the established pattern in AgentV2GQL. This satisfies mypy's Iterable[Self | None] constraint. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… None Fetcher functions now propagate the exception so GraphQL returns error info to the user. resolve_nodes still catches it to comply with the relay spec (Iterable[Self | None]). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: octodog <mu001@lablup.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eliminate duplicated ResourceSlotTypeData construction between all_slot_types() and get_resource_slot_type() methods. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: octodog <mu001@lablup.com>
… resource_slot_types - Remove AllSlotTypesAction from service, processors, and actions/__init__.py - Add ResourceSlotTypeFilterGQL, ResourceSlotTypeOrderFieldGQL, ResourceSlotTypeOrderByGQL to types.py - Add CursorConditions.by_cursor_forward/backward to query.py for cursor pagination - Update fetch_resource_slot_types fetcher to use build_querier() + SearchResourceSlotTypesAction with computed PageInfo (has_next_page/has_previous_page from actual results) - Update resource_slot_types resolver to accept pagination args (first/after/last/before/limit/offset) and filter/order_by, following session GQL pattern Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: octodog <mu001@lablup.com>
…source connections - Add AgentResourceQueryConditions/Orders and ResourceAllocationQueryConditions/Orders to query.py for cursor-based pagination on slot_name - Update fetch_agent_resources and fetch_kernel_allocations to accept pagination args (first/after/last/before/limit/offset) and use SearchAgentResourcesAction / SearchResourceAllocationsAction via build_querier() with base_conditions scope filter - Compute has_next_page/has_previous_page from actual search results instead of hardcoded False - Cursor now encodes slot_name only (within fixed agent/kernel scope) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…inear scan - Add AgentResourceNotFound and ResourceAllocationNotFound error types - Add get_agent_resource_by_slot and get_kernel_allocation_by_slot DB source methods - Add corresponding repository delegation methods - Add GetAgentResourceBySlotAction and GetKernelAllocationBySlotAction - Register new processors in ResourceSlotProcessors - Update load_agent_resource_data, load_kernel_allocation_data, fetch_agent_resource_slot, fetch_kernel_resource_allocation to use slot-specific actions instead of full-list fetch + linear scan Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eptions from fetchers - ResourceSlotTypeGQL.resolve_nodes: re-raise ResourceSlotTypeNotFound when required=True - AgentResourceSlotGQL.resolve_nodes: catch AgentResourceNotFound, raise if required=True - KernelResourceAllocationGQL.resolve_nodes: catch ResourceAllocationNotFound, raise if required=True - load_agent_resource_data / load_kernel_allocation_data: remove silent catch→None pattern, let domain exceptions propagate per CLAUDE.md error handling principle Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add pagination args (first/after/last/before/limit/offset) to AgentV2GQL.resource_slots and KernelV2GQL.resource_allocations field resolvers so clients can paginate these connections - Delete unused AllSlotTypesAction service action file - Remove dead fetch_agent_resource_slot and fetch_kernel_resource_allocation functions from fetcher.py - Simplify load_resource_slot_type_data: return action_result.item directly instead of redundantly reconstructing ResourceSlotTypeData Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: octodog <mu001@lablup.com>
…cation Conform nested connections to the existing scope/filter/order pattern: - AgentResourceSlotFilterGQL: filter by slot_name - AgentResourceSlotOrderByGQL: order by slot_name, capacity, used - KernelResourceAllocationFilterGQL: filter by slot_name - KernelResourceAllocationOrderByGQL: order by slot_name, requested, used Update fetcher functions and nested connection resolvers in AgentV2GQL and KernelV2GQL to accept and pass filter/order_by parameters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: octodog <mu001@lablup.com>
Summary
ResourceSlotTypeGQLnode exposing allresource_slot_typestable columns (slot_name, slot_type, display_name, description, display_unit, display_icon, number_format, rank)resource_slot_type(slot_name)andresource_slot_types(filter, order, pagination)with Connection supportAgentResourceSlotGQLnode andresource_slotsconnection field onAgentV2GQLKernelResourceAllocationGQLnode andresource_allocationsconnection field onKernelV2GQLapi/gql/resource_slot/fetcher.pyreused across root queries and connection resolversTest plan
ResourceSlotTypeGQLnode exposes all resource_slot_types columnsresource_slot_typesreturns Connection with filter/order/paginationresource_slot_type(slot_name)returns single node or nullAgentV2GQLhasresource_slotsfield returningAgentResourceSlotConnectionGQLKernelV2GQLhasresource_allocationsfield returningKernelResourceAllocationConnectionGQLpants lintandpants checkpassResolves BA-4904
📚 Documentation preview 📚: https://sorna--9708.org.readthedocs.build/en/9708/
📚 Documentation preview 📚: https://sorna-ko--9708.org.readthedocs.build/ko/9708/