Skip to content

fix: include pod clique pod index in pod name#643

Closed
AsadShahid04 wants to merge 2 commits into
ai-dynamo:mainfrom
AsadShahid04:fix/pod-name-clique-index-635
Closed

fix: include pod clique pod index in pod name#643
AsadShahid04 wants to merge 2 commits into
ai-dynamo:mainfrom
AsadShahid04:fix/pod-name-clique-index-635

Conversation

@AsadShahid04
Copy link
Copy Markdown

Summary

  • Changed GenerateName in buildResource to embed the pod index in the pod name prefix (e.g. ubuntu-0-worker-0-<suffix> instead of ubuntu-0-worker-<suffix>)
  • Added unit tests: TestGetLabels_PodIndexLabel and TestPodGenerateNameIncludesPodIndex

Problem

When a pod failure occurs, the hostname in logs (e.g. ubuntu-0-worker) identifies the PodClique but not which replica failed. Users must resort to custom kubectl get pods -o custom-columns queries to map pod names to their index, which doesn't work with standard options like -o wide.

Solution

Include the pod's clique pod index in the GenerateName prefix so that Kubernetes-assigned names encode the index directly:

Before: ubuntu-0-worker-2tnab, ubuntu-0-worker-5mfde
After:  ubuntu-0-worker-0-2tnab, ubuntu-0-worker-1-5mfde

The pod index is already tracked as a label (grove.io/podclique-pod-index) and used for hostname and env var injection — this change surfaces it in the pod name itself, consistent with how PCS and PCSG replica indices are already embedded in resource names.

Testing

  • go build ./...: passed
  • go test ./internal/controller/podclique/components/pod/...: passed
  • New tests: TestGetLabels_PodIndexLabel, TestPodGenerateNameIncludesPodIndex

Closes #635

Pod names are generated using GenerateName, which appends a random
suffix. Before this change, two pods for the same PodClique were
named e.g. ubuntu-0-worker-2tnab and ubuntu-0-worker-5mfde,
making it impossible to correlate the hostname seen in logs with a
specific pod without querying custom columns.

Include the per-pod index (LabelPodCliquePodIndex) directly in the
GenerateName prefix so that names become ubuntu-0-worker-0-2tnab
and ubuntu-0-worker-1-5mfde. This mirrors the existing convention
used for PCS and PCSG replica indices and requires no label lookup
to identify which pod is which.

Closes ai-dynamo#635

Signed-off-by: OpenClaw Agent <agent@openclaw.local>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Pod names now have the format <pclqName>-<podIndex>-<k8sRandom> after
the previous commit added the pod clique index to GenerateName. The
helper function extractPCLQNameFromPodName only stripped one trailing
segment, returning <pclqName>-<podIndex> instead of <pclqName>, which
broke the PodGang→PodClique reconcile mapping and left all pods
permanently ScheduleGated.

Strip two segments (random suffix then pod index) to correctly recover
the PodClique FQN.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: OpenClaw Agent <agent@openclaw.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add pod clique pod index to pod name for better visibility

1 participant