Skip to content

feat(BA-4904): add GraphQL ResourceSlotType node with root queries and connections#9708

Open
HyeockJinKim wants to merge 17 commits intomainfrom
BA-4904
Open

feat(BA-4904): add GraphQL ResourceSlotType node with root queries and connections#9708
HyeockJinKim wants to merge 17 commits intomainfrom
BA-4904

Conversation

@HyeockJinKim
Copy link
Collaborator

@HyeockJinKim HyeockJinKim commented Mar 5, 2026

Summary

  • Add ResourceSlotTypeGQL node exposing all resource_slot_types table columns (slot_name, slot_type, display_name, description, display_unit, display_icon, number_format, rank)
  • Add root queries resource_slot_type(slot_name) and resource_slot_types(filter, order, pagination) with Connection support
  • Add AgentResourceSlotGQL node and resource_slots connection field on AgentV2GQL
  • Add KernelResourceAllocationGQL node and resource_allocations connection field on KernelV2GQL
  • Shared fetcher pattern in api/gql/resource_slot/fetcher.py reused across root queries and connection resolvers

Test plan

  • ResourceSlotTypeGQL node exposes all resource_slot_types columns
  • Root query resource_slot_types returns Connection with filter/order/pagination
  • Root query resource_slot_type(slot_name) returns single node or null
  • AgentV2GQL has resource_slots field returning AgentResourceSlotConnectionGQL
  • KernelV2GQL has resource_allocations field returning KernelResourceAllocationConnectionGQL
  • Fetcher functions shared (no duplication) between root queries and connection resolvers
  • All types registered in GQL schema and queryable via introspection
  • pants lint and pants check pass

Resolves BA-4904


📚 Documentation preview 📚: https://sorna--9708.org.readthedocs.build/en/9708/


📚 Documentation preview 📚: https://sorna-ko--9708.org.readthedocs.build/ko/9708/

Copilot AI review requested due to automatic review settings March 5, 2026 15:54
@github-actions github-actions bot added the size:XL 500~ LoC label Mar 5, 2026
HyeockJinKim added a commit that referenced this pull request Mar 5, 2026
@github-actions github-actions bot added comp:manager Related to Manager component comp:common Related to Common component labels Mar 5, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new GraphQL surface area for resource slot metadata and per-entity slot usage/allocation, backed by new service actions and shared fetcher helpers.

Changes:

  • Introduces ResourceSlotTypeGQL (+ NumberFormat) and root queries resource_slot_type / resource_slot_types.
  • Adds Relay-style connection fields on AgentV2GQL (resource_slots) and KernelV2GQL (resource_allocations).
  • Extends resource-slot service/processors/actions to support fetching slot-type registry entries.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/ai/backend/manager/services/resource_slot/service.py Adds service methods to fetch all slot types and a single slot type, mapping repository rows into data objects.
src/ai/backend/manager/services/resource_slot/processors.py Registers new action processors for slot-type actions.
src/ai/backend/manager/services/resource_slot/actions/get_slot_type.py Adds action/result for fetching a single slot type.
src/ai/backend/manager/services/resource_slot/actions/all_slot_types.py Adds action/result for fetching all slot types.
src/ai/backend/manager/services/resource_slot/actions/init.py Exposes new actions/results via package exports.
src/ai/backend/manager/data/resource_slot/types.py Adds NumberFormatData and extends ResourceSlotTypeData with additional fields.
src/ai/backend/manager/api/gql/schema.py Wires new root query resolvers into the GraphQL schema.
src/ai/backend/manager/api/gql/resource_slot/types.py Introduces new GraphQL Node + Connection types for slot types, agent resources, and kernel allocations.
src/ai/backend/manager/api/gql/resource_slot/resolver.py Adds root query resolvers for slot type queries.
src/ai/backend/manager/api/gql/resource_slot/fetcher.py Adds shared fetcher utilities for root queries and connection fields.
src/ai/backend/manager/api/gql/resource_slot/init.py Adds package marker.
src/ai/backend/manager/api/gql/kernel/types.py Adds resource_allocations connection field on KernelV2GQL.
src/ai/backend/manager/api/gql/agent/types.py Adds resource_slots connection field on AgentV2GQL.
src/ai/backend/common/data/permission/types.py Adds new RBAC entity type RESOURCE_SLOT_TYPE.
changes/9708.feature.md Adds changelog entry for the new GraphQL nodes/queries/fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


@strawberry.field(description="Added in 26.4.0. Returns all registered resource slot types.") # type: ignore[misc]
async def resource_slot_types(
info: Info[StrawberryGQLContext],
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This root field returns a Relay Connection but does not accept any pagination arguments (e.g., first/after/last/before). As written, clients cannot paginate and page_info is always hardcoded in the fetcher; either expose proper Relay pagination parameters (and implement slicing + has_next_page/has_previous_page) or change the API to return a plain list instead of a Connection.

Suggested change
info: Info[StrawberryGQLContext],
info: Info[StrawberryGQLContext],
first: int | None = None,
after: str | None = None,
last: int | None = None,
before: str | None = None,

Copilot uses AI. Check for mistakes.
Comment on lines +58 to +63
page_info = strawberry.relay.PageInfo(
has_next_page=False,
has_previous_page=False,
start_cursor=edges[0].cursor if edges else None,
end_cursor=edges[-1].cursor if edges else None,
)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These connections are constructed with has_next_page/has_previous_page hardcoded to False, which produces misleading Relay semantics once the dataset grows. If the field is intended to be a real Relay connection, compute these flags based on the requested window (first/after/...) and the underlying total; otherwise consider returning a list type to avoid implying pagination support.

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +86
@classmethod
async def resolve_nodes( # type: ignore[override]
cls,
*,
info: Info[StrawberryGQLContext],
node_ids: Iterable[str],
required: bool = False,
) -> Iterable[Self | None]:
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The required flag is part of Strawberry's Node resolution contract, but it is currently ignored in all three resolve_nodes implementations in this file. When required=True and a node is missing, the resolver should raise (instead of returning None) to match the expected behavior for non-null node lookups.

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +93
results: list[Self | None] = []
for slot_name in node_ids:
data = await load_resource_slot_type_data(info, slot_name)
results.append(cls.from_data(data) if data is not None else None)
return results
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This performs one awaited fetch per node_id, causing an N+1 pattern for Relay node resolution. Prefer batching: fetch all requested slot_names in one service/repository call (or through a request-scoped DataLoader), then map results back to the original node_ids order.

Copilot uses AI. Check for mistakes.
Comment on lines +202 to +228
"""Load raw AgentResourceData for a single agent+slot (used by Node.resolve_nodes)."""
action_result = (
await info.context.processors.resource_slot.get_agent_resources.wait_for_complete(
GetAgentResourcesAction(agent_id=agent_id)
)
)
for data in action_result.items:
if data.slot_name == slot_name:
return data
return None


async def load_kernel_allocation_data(
info: Info[StrawberryGQLContext],
kernel_id_str: str,
slot_name: str,
) -> ResourceAllocationData | None:
"""Load raw ResourceAllocationData for a single kernel+slot (used by Node.resolve_nodes)."""
action_result = (
await info.context.processors.resource_slot.get_kernel_allocations.wait_for_complete(
GetKernelAllocationsAction(kernel_id=_uuid.UUID(kernel_id_str))
)
)
for data in action_result.items:
if data.slot_name == slot_name:
return data
return None
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For per-slot Node resolution you re-fetch the full agent resource list and linearly scan it for every requested slot. This is O(N*M) across multiple nodes and can produce repeated identical backend calls. Consider adding a dedicated service method/action to fetch a single slot (agent_id + slot_name), or batch: fetch once per agent_id and build a dict keyed by slot_name.

Suggested change
"""Load raw AgentResourceData for a single agent+slot (used by Node.resolve_nodes)."""
action_result = (
await info.context.processors.resource_slot.get_agent_resources.wait_for_complete(
GetAgentResourcesAction(agent_id=agent_id)
)
)
for data in action_result.items:
if data.slot_name == slot_name:
return data
return None
async def load_kernel_allocation_data(
info: Info[StrawberryGQLContext],
kernel_id_str: str,
slot_name: str,
) -> ResourceAllocationData | None:
"""Load raw ResourceAllocationData for a single kernel+slot (used by Node.resolve_nodes)."""
action_result = (
await info.context.processors.resource_slot.get_kernel_allocations.wait_for_complete(
GetKernelAllocationsAction(kernel_id=_uuid.UUID(kernel_id_str))
)
)
for data in action_result.items:
if data.slot_name == slot_name:
return data
return None
"""Load raw AgentResourceData for a single agent+slot (used by Node.resolve_nodes).
To avoid repeated backend calls and linear scans when resolving multiple slots for
the same agent within a single request, this function caches the full list of
resources per agent_id on the GraphQL context and indexes them by slot_name.
"""
# Per-request cache: info.context._agent_resources_cache
ctx = info.context
cache = getattr(ctx, "_agent_resources_cache", None)
if cache is None:
cache = {}
setattr(ctx, "_agent_resources_cache", cache)
agent_cache = cache.get(agent_id)
if agent_cache is None:
action_result = (
await ctx.processors.resource_slot.get_agent_resources.wait_for_complete(
GetAgentResourcesAction(agent_id=agent_id)
)
)
# Index resources by slot_name for O(1) lookup.
agent_cache = {item.slot_name: item for item in action_result.items}
cache[agent_id] = agent_cache
return agent_cache.get(slot_name)
async def load_kernel_allocation_data(
info: Info[StrawberryGQLContext],
kernel_id_str: str,
slot_name: str,
) -> ResourceAllocationData | None:
"""Load raw ResourceAllocationData for a single kernel+slot (used by Node.resolve_nodes).
Similar to load_agent_resource_data(), this caches allocations per kernel_id within
a single request to prevent repeated backend calls and linear scans when multiple
slots are resolved for the same kernel.
"""
# Per-request cache: info.context._kernel_allocations_cache
ctx = info.context
cache = getattr(ctx, "_kernel_allocations_cache", None)
if cache is None:
cache = {}
setattr(ctx, "_kernel_allocations_cache", cache)
kernel_cache = cache.get(kernel_id_str)
if kernel_cache is None:
action_result = (
await ctx.processors.resource_slot.get_kernel_allocations.wait_for_complete(
GetKernelAllocationsAction(kernel_id=_uuid.UUID(kernel_id_str))
)
)
# Index allocations by slot_name for O(1) lookup.
kernel_cache = {item.slot_name: item for item in action_result.items}
cache[kernel_id_str] = kernel_cache
return kernel_cache.get(slot_name)

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +44
items = [
ResourceSlotTypeData(
slot_name=row.slot_name,
slot_type=row.slot_type,
display_name=row.display_name,
description=row.description,
display_unit=row.display_unit,
display_icon=row.display_icon,
number_format=NumberFormatData(
binary=row.number_format.binary,
round_length=row.number_format.round_length,
),
rank=row.rank,
)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The row→ResourceSlotTypeData mapping logic is duplicated in both all_slot_types() and get_slot_type(). Extract a small private helper (e.g., _to_resource_slot_type_data(row)) to keep the mapping consistent and reduce the chance of future drift when columns are added/changed.

Copilot uses AI. Check for mistakes.
@github-actions github-actions bot added the area:docs Documentations label Mar 5, 2026
HyeockJinKim added a commit that referenced this pull request Mar 5, 2026
@HyeockJinKim HyeockJinKim added skip:ci Make the action workflow to skip running lint, check, and test (use with caution!) and removed skip:ci Make the action workflow to skip running lint, check, and test (use with caution!) labels Mar 5, 2026
Comment on lines +210 to +224
async def load_agent_resource_data(
info: Info[StrawberryGQLContext],
agent_id: str,
slot_name: str,
) -> AgentResourceData | None:
"""Load raw AgentResourceData for a single agent+slot (used by Node.resolve_nodes)."""
action_result = (
await info.context.processors.resource_slot.get_agent_resources.wait_for_complete(
GetAgentResourcesAction(agent_id=agent_id)
)
)
for data in action_result.items:
if data.slot_name == slot_name:
return data
return None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears a separate query & service action is needed that accepts input up to the slot_name, not just agent resources.

Comment on lines +227 to +241
async def load_kernel_allocation_data(
info: Info[StrawberryGQLContext],
kernel_id_str: str,
slot_name: str,
) -> ResourceAllocationData | None:
"""Load raw ResourceAllocationData for a single kernel+slot (used by Node.resolve_nodes)."""
action_result = (
await info.context.processors.resource_slot.get_kernel_allocations.wait_for_complete(
GetKernelAllocationsAction(kernel_id=_uuid.UUID(kernel_id_str))
)
)
for data in action_result.items:
if data.slot_name == slot_name:
return data
return None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we should receive it here rather than separating it after slot_name.



@strawberry.field(
description="Added in 26.4.0. Returns a single resource slot type by slot_name, or null."
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please set the target version to 26.3.0 for all.

Comment on lines +32 to +33
async def all_slot_types(self, action: AllSlotTypesAction) -> AllSlotTypesResult:
rows = await self._repository.all_slot_types()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to provide the search functionality that was originally offered, rather than the 'all' option. I don't want to provide the 'all' option.

Comment on lines +471 to +481
async def resource_allocations(
self,
info: Info[StrawberryGQLContext],
) -> Annotated[
ResourceAllocationConnectionGQL,
strawberry.lazy("ai.backend.manager.api.gql.resource_slot.types"),
]:
"""Fetch per-slot resource allocation for this kernel."""
from ai.backend.manager.api.gql.resource_slot.fetcher import fetch_kernel_allocations

return await fetch_kernel_allocations(info=info, kernel_id=str(self.id))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to the pattern where requests for connection passed all arguments such as filter, order, before, etc.

Comment on lines +449 to +461
async def resource_slots(
self,
info: Info[StrawberryGQLContext],
first: int | None = None,
after: str | None = None,
last: int | None = None,
before: str | None = None,
limit: int | None = None,
offset: int | None = None,
) -> Annotated[
AgentResourceConnectionGQL,
strawberry.lazy("ai.backend.manager.api.gql.resource_slot.types"),
]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter and order are missing.

Comment on lines +471 to +480
async def resource_allocations(
self,
info: Info[StrawberryGQLContext],
first: int | None = None,
after: str | None = None,
last: int | None = None,
before: str | None = None,
limit: int | None = None,
offset: int | None = None,
) -> Annotated[
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter and order are missing.

Comment on lines +164 to +173
async def fetch_agent_resources(
info: Info[StrawberryGQLContext],
agent_id: str,
before: str | None = None,
after: str | None = None,
first: int | None = None,
last: int | None = None,
limit: int | None = None,
offset: int | None = None,
) -> AgentResourceConnectionGQL:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter and order are missing. Instead of receiving agent_id, we need to verify the existing structure that receives scope, filter, and order.

HyeockJinKim and others added 11 commits March 6, 2026 16:13
…d connections

- Add ResourceSlotTypeGQL(Node) exposing all resource_slot_types columns
  (slot_name, slot_type, display_name, description, display_unit, display_icon,
  number_format, rank) with ResourceSlotTypeConnectionGQL
- Add AgentResourceSlotGQL(Node) for per-slot capacity/usage on agents with
  AgentResourceConnectionGQL; wire as resource_slots field on AgentV2GQL
- Add KernelResourceAllocationGQL(Node) for per-slot allocation on kernels with
  ResourceAllocationConnectionGQL; wire as resource_allocations field on KernelV2GQL
- Add root queries resource_slot_type(slot_name) and resource_slot_types()
- Shared fetcher functions reused across root queries and connection resolvers
- Add AllSlotTypesAction/GetSlotTypeAction to ResourceSlotService and processors
- Add NumberFormatData to data layer; add RESOURCE_SLOT_TYPE to EntityType enum

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_data()

Replace fetcher-returning-GQL-type pattern in resolve_nodes with
data-returning helpers + cls.from_data() calls, following the established
pattern in AgentV2GQL. This satisfies mypy's Iterable[Self | None] constraint.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… None

Fetcher functions now propagate the exception so GraphQL returns error
info to the user.  resolve_nodes still catches it to comply with the
relay spec (Iterable[Self | None]).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: octodog <mu001@lablup.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eliminate duplicated ResourceSlotTypeData construction between
all_slot_types() and get_resource_slot_type() methods.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: octodog <mu001@lablup.com>
… resource_slot_types

- Remove AllSlotTypesAction from service, processors, and actions/__init__.py
- Add ResourceSlotTypeFilterGQL, ResourceSlotTypeOrderFieldGQL, ResourceSlotTypeOrderByGQL to types.py
- Add CursorConditions.by_cursor_forward/backward to query.py for cursor pagination
- Update fetch_resource_slot_types fetcher to use build_querier() + SearchResourceSlotTypesAction
  with computed PageInfo (has_next_page/has_previous_page from actual results)
- Update resource_slot_types resolver to accept pagination args (first/after/last/before/limit/offset)
  and filter/order_by, following session GQL pattern

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: octodog <mu001@lablup.com>
…source connections

- Add AgentResourceQueryConditions/Orders and ResourceAllocationQueryConditions/Orders
  to query.py for cursor-based pagination on slot_name
- Update fetch_agent_resources and fetch_kernel_allocations to accept pagination args
  (first/after/last/before/limit/offset) and use SearchAgentResourcesAction /
  SearchResourceAllocationsAction via build_querier() with base_conditions scope filter
- Compute has_next_page/has_previous_page from actual search results instead of hardcoded False
- Cursor now encodes slot_name only (within fixed agent/kernel scope)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HyeockJinKim and others added 5 commits March 6, 2026 16:14
…inear scan

- Add AgentResourceNotFound and ResourceAllocationNotFound error types
- Add get_agent_resource_by_slot and get_kernel_allocation_by_slot DB source methods
- Add corresponding repository delegation methods
- Add GetAgentResourceBySlotAction and GetKernelAllocationBySlotAction
- Register new processors in ResourceSlotProcessors
- Update load_agent_resource_data, load_kernel_allocation_data,
  fetch_agent_resource_slot, fetch_kernel_resource_allocation to use
  slot-specific actions instead of full-list fetch + linear scan

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eptions from fetchers

- ResourceSlotTypeGQL.resolve_nodes: re-raise ResourceSlotTypeNotFound when required=True
- AgentResourceSlotGQL.resolve_nodes: catch AgentResourceNotFound, raise if required=True
- KernelResourceAllocationGQL.resolve_nodes: catch ResourceAllocationNotFound, raise if required=True
- load_agent_resource_data / load_kernel_allocation_data: remove silent catch→None pattern,
  let domain exceptions propagate per CLAUDE.md error handling principle

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add pagination args (first/after/last/before/limit/offset) to
  AgentV2GQL.resource_slots and KernelV2GQL.resource_allocations field
  resolvers so clients can paginate these connections
- Delete unused AllSlotTypesAction service action file
- Remove dead fetch_agent_resource_slot and fetch_kernel_resource_allocation
  functions from fetcher.py
- Simplify load_resource_slot_type_data: return action_result.item directly
  instead of redundantly reconstructing ResourceSlotTypeData

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: octodog <mu001@lablup.com>
…cation

Conform nested connections to the existing scope/filter/order pattern:

- AgentResourceSlotFilterGQL: filter by slot_name
- AgentResourceSlotOrderByGQL: order by slot_name, capacity, used
- KernelResourceAllocationFilterGQL: filter by slot_name
- KernelResourceAllocationOrderByGQL: order by slot_name, requested, used

Update fetcher functions and nested connection resolvers in AgentV2GQL
and KernelV2GQL to accept and pass filter/order_by parameters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: octodog <mu001@lablup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs Documentations comp:common Related to Common component comp:manager Related to Manager component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants