Skip to content

feat(engine): Graph engine simulation#314

Open
ravisoundar wants to merge 1 commit into
mainfrom
rs-maestro-engine
Open

feat(engine): Graph engine simulation#314
ravisoundar wants to merge 1 commit into
mainfrom
rs-maestro-engine

Conversation

@ravisoundar
Copy link
Copy Markdown
Collaborator

Description

Maestro engine simulation.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • All commits are signed off per DCO (git commit -s).

@ravisoundar ravisoundar requested a review from dmitsh as a code owner May 8, 2026 05:11
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 8, 2026

Greptile Summary

This PR introduces a new graph engine (pkg/engines/graph) that serializes compute instance topology data into a JSON document rather than a cluster manager config format. It also consolidates simulation-model types (Node, NodeAttributes, GPU, etc.) from the local models package into the shared topology package, updates the Engine.GenerateOutput interface to carry computeInstances and environment parameters, and threads GetInstances through every sim provider.

  • New graph engine: resolves instances from an InstanceCache or via an InstanceProvider (from environment), validates completeness, sorts by ID, and writes JSON to stdout or a configured output path.
  • topology.Node / NodeAttributes: a single struct now serves both YAML simulation models and JSON API export, with custom MarshalJSON/UnmarshalJSON to bridge the flat-YAML shape (nvlink, status, timestamp) to the nested-JSON shape (attributes.gpu.*).
  • Provider changes: all sim providers now carry a *models.Model reference and implement GetInstances, which delegates to model.GetInstances with CloneForProvider for provider-scoped export.

Confidence Score: 5/5

Safe to merge; the graph engine logic, cache integration, and provider plumbing are all correct.

The core engine correctly guards the InstanceProvider call behind a len(missing) > 0 check, validates that every requested instance ID was returned, and propagates errors faithfully. The consolidation of Node types into the topology package is cleanly executed. Two minor JSON-contract observations (missing omitempty on GpuAttribute string fields, nil vs empty-slice for NetworkLayers) are style-level and do not break any current path.

pkg/topology/instances.go — JSON serialization behaviour for nodes with no GPU metadata or no switch associations.

Important Files Changed

Filename Overview
pkg/topology/instances.go New file introducing Node, NodeAttributes, Instances types and custom JSON marshal/unmarshal. GpuAttribute.Status and CollectedAt lack omitempty causing empty strings in output for nodes with no GPU data; NetworkLayers can serialize as null when NetLayers is nil in CloneForProvider.
pkg/engines/graph/engine.go New graph engine with cache-then-provider lookup, completeness validation, and optional file output. Core logic is correct; addressed previous concerns about missing-instance validation and unconditional provider assertion.
pkg/engines/graph/engine_test.go Comprehensive tests covering cache hits/misses, provider errors, missing-instance validation, output path, and sorting. Error path for missing provider is now correctly exercised with a non-empty cis.
pkg/models/model.go Local Node/NodeAttributes types replaced with topology.Node; field renames (Name→ID) propagated throughout. New GetInstances delegates to CloneForProvider for provider-scoped export.
pkg/engines/engines.go Engine interface updated; new InstanceCache and InstanceProvider interfaces added. Clean, well-documented additions.
pkg/server/engine.go GenerateOutput call updated to pass computeInstances and provider as environment; straightforward one-line change.
pkg/providers/test/test.go Adds model reference to Provider and implements GetInstances; returns nil when model is nil (no synthetic GetInstances support), leading to 'instance provider did not return data' errors when the graph engine is used without a model.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Server as server.processTopologyRequest
    participant Provider as providers.Provider
    participant GraphEngine as graph.GraphEngine
    participant Cache as InstanceCache
    participant InstProv as engines.InstanceProvider

    Client->>Server: POST /v1/generate (graph engine)
    Server->>Provider: GetComputeInstances(ctx)
    Provider-->>Server: []ComputeInstances (instanceID→node map)
    Server->>Provider: GenerateTopologyConfig(ctx, computeInstances)
    Provider-->>Server: "*topology.Graph"
    Server->>GraphEngine: GenerateOutput(ctx, graph, computeInstances, params, provider)

    GraphEngine->>Cache: Get(ctx, cacheKey) for each instanceID
    alt cache hit
        Cache-->>GraphEngine: "*topology.Node"
    else cache miss
        Cache-->>GraphEngine: nil → added to missing[]
    end

    alt "len(missing) > 0"
        GraphEngine->>InstProv: GetInstances(ctx, missing)
        InstProv-->>GraphEngine: []topology.Node
        GraphEngine->>Cache: Set(ctx, key, node) for each result
        GraphEngine->>GraphEngine: verify all missing IDs returned
    end

    GraphEngine->>GraphEngine: sort instances by ID, marshal JSON
    alt outputPath set
        GraphEngine->>GraphEngine: files.Validate + files.Create
        GraphEngine-->>Server: OK
    else no outputPath
        GraphEngine-->>Server: JSON bytes
    end
    Server-->>Client: response
Loading

Reviews (7): Last reviewed commit: "feat(engine): Graph Engine simulator" | Re-trigger Greptile

Comment thread pkg/engines/maestro/engine.go Outdated
Comment thread pkg/engines/maestro/engine.go Outdated
Comment thread pkg/engines/maestro/engine.go Outdated
@ravisoundar ravisoundar force-pushed the rs-maestro-engine branch from 64556d5 to 3979991 Compare May 8, 2026 15:34
Comment thread pkg/engines/maestro/engine.go Outdated
@ravisoundar ravisoundar force-pushed the rs-maestro-engine branch from 3979991 to 7029cf2 Compare May 8, 2026 15:54
Comment thread pkg/engines/graph/engine_test.go
@ravisoundar ravisoundar force-pushed the rs-maestro-engine branch 2 times, most recently from 6728214 to 16ee88a Compare May 8, 2026 16:30
@ravisoundar ravisoundar changed the title feat(engine): Maestro engine simulation feat(engine): Graph engine simulation May 8, 2026
Comment thread pkg/engines/graph/engine.go Outdated
Comment thread pkg/engines/graph/engine.go
Signed-off-by: Ravi Shankar <ravish@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant