Skip to content

Harness preview: customer-env 'ctr run' fails (exit 1) for any environmentArtifact.containerConfiguration.containerUri, including stock public images from the docs #931

@aryankhanna2004

Description

@aryankhanna2004

Summary

On the current preview release (@aws/agentcore@preview, CLI v1.0.0-preview.1), any harness that has a non-null environmentArtifact.containerConfiguration.containerUri fails at invoke time with:

runtimeClientError: Command '['/usr/local/bin/ctr', '-a', '/run/containerd/containerd.sock',
'run', '-d', '--net-host',
'--mount=type=bind,src=/mnt/data,dst=/mnt/data,options=rbind:rw',
'<containerUri>', 'customer-env', '/bin/sh', '-c', 'sleep infinity']'
returned non-zero exit status 1.

Harnesses that do not set environmentArtifact (i.e. use the default image) work fine in the same project, same region, same execution role template, same session format.

This looks like a service-side bug in the AgentCore Harness runtime's customer-env spawn path, not in the CLI. I'm filing it here per the preview bug-report channel in the README; feel free to transfer it to the right internal repo.

Reproducer (no custom image, no custom CLI changes)

Minimal harness config — uses the exact image shown in the official docs at harness-environment.html ("Or reference a pre-built image: public.ecr.aws/docker/library/node:slim" — repro below uses python:3.12-slim-bookworm from the same public registry; node:slim reproduces identically):

{
  "name": "probe",
  "model": {
    "provider": "bedrock",
    "modelId": "us.anthropic.claude-opus-4-5-20251101-v1:0"
  },
  "memory": { "name": "someMemory" },
  "containerUri": "public.ecr.aws/docker/library/python:3.12-slim-bookworm",
  "sessionStoragePath": "/mnt/data",
  "maxIterations": 10,
  "timeoutSeconds": 300,
  "authorizerType": "AWS_IAM"
}
agentcore deploy --yes
agentcore invoke --harness probe --session-id probe-diag-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx --user-id me 'PROBE OK'

Result

Error: Command '['/usr/local/bin/ctr', '-a', '/run/containerd/containerd.sock', 'run', '-d',
'--net-host', '--mount=type=bind,src=/mnt/data,dst=/mnt/data,options=rbind:rw',
'public.ecr.aws/docker/library/python:3.12-slim-bookworm', 'customer-env',
'/bin/sh', '-c', 'sleep infinity']' returned non-zero exit status 1.

Control (works)

Identical harness config with the containerUri field removed — the same invoke call succeeds and the agent replies normally. GetHarness shows environmentArtifact: null on the working one and environmentArtifact.containerConfiguration.containerUri: "public.ecr.aws/i0n3d3i5/harness-us-east-1:latest" (the managed harness runtime image) on the working one — i.e., environmentArtifact is what changes behavior, and nothing below the service layer.

Things I ruled out

  • Our image / our Dockerfile — reproduces with the stock public Python image from AWS's own docs. Reproduces with public.ecr.aws/docker/library/node:slim as well.
  • Architecture mismatch — both images are multi-arch manifests that include linux/arm64; the microVM host is arm64 (confirmed via uname -m on a working default-image harness → aarch64).
  • ECR pull permissions — same error with public ECR (no creds needed) and with private ECR after attaching ecr:BatchCheckLayerAvailability / ecr:GetDownloadUrlForLayer / ecr:BatchGetImage to the harness execution role. The AgentCore runtime logs confirm Pulled customer image: ... succeeds before the ctr run call fails.
  • Missing mount destination — adding RUN mkdir -p /mnt/data to a custom image changes nothing. Stock public images that have no /mnt/data baked in also fail, and ctr's rbind option creates the destination if absent.
  • CLI — the same harness created directly via bedrock-agentcore-control CreateHarness with the same JSON reproduces; so does one created by @aws/agentcore@preview with either #929 or #930 applied.
  • Session id format / length — other harnesses in the same project work with the same session-id generator (33+ chars).

What probably needs to happen service-side

ctr run exiting with status 1 is almost always one of: image fails to mount root fs, OCI config/user/capabilities rejected, container name in use, or snapshotter error. Any of them writes a specific message to stderr. That stderr is currently being swallowed by the harness runtime's error wrapper — the caller only ever sees non-zero exit status 1, with no detail. Fixing that alone would unblock customer self-diagnosis of every bug in this area.

Two asks

  1. Investigate why the customer-env ctr run fails for all non-default containerConfiguration.containerUri values on the current preview.
  2. In the harness runtime's subprocess.run(...) wrapper around ctr, capture and re-raise (or log to the customer's log stream) ctr's stdout+stderr when it exits non-zero, so future bugs in this area aren't opaque.

Evidence

Invoke log from the probe harness (full log retained locally):

[16:13:22.533] INVOKE REQUEST (Session: probe-diag-20260422-161500-aryan-qrstuvwx12)
  runtimeArn: arn:aws:bedrock-agentcore:us-east-1:216989103356:harness/cic101pptagent_probe-JMU2AFlACj
  prompt: "Just reply with the text: PROBE OK"

[16:13:26.182] ERROR CONTEXT: stream error
[16:13:26.182] ERROR: runtimeClientError: Command '['/usr/local/bin/ctr', '-a',
  '/run/containerd/containerd.sock', 'run', '-d', '--net-host',
  '--mount=type=bind,src=/mnt/data,dst=/mnt/data,options=rbind:rw',
  'public.ecr.aws/docker/library/python:3.12-slim-bookworm', 'customer-env',
  '/bin/sh', '-c', 'sleep infinity']' returned non-zero exit status 1.

GetHarness on the broken harness (trimmed):

{
  "harnessName": "cic101pptagent_probe",
  "status": "READY",
  "environmentArtifact": {
    "containerConfiguration": {
      "containerUri": "public.ecr.aws/docker/library/python:3.12-slim-bookworm"
    }
  },
  "environment": {
    "agentCoreRuntimeEnvironment": {
      "agentRuntimeArn": "arn:aws:bedrock-agentcore:us-east-1:216989103356:runtime/harness_cic101pptagent_probe-HQxcVa26D8",
      "networkConfiguration": { "networkMode": "PUBLIC" },
      "filesystemConfigurations": [{ "sessionStorage": { "mountPath": "/mnt/data" } }]
    }
  }
}

Control (working) harness — identical config minus containerUri:

  • GetHarnessenvironmentArtifact: null
  • Same invoke prompt → returns PLAIN OK
  • agentcore invoke --exec 'python3 --version && uname -m'Python 3.10.19, aarch64

Environment

  • CLI: @aws/agentcore@preview @ v1.0.0-preview.1 (also repro'd on a local build of main)
  • Region: us-east-1
  • Account: 216989103356
  • AWS CLI v2 / node v20 / macOS arm64 host

Related

Neither PR changes the behavior reported here — both successfully build/push an image and create the harness; invoke still hits this service-side failure.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions