Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions plugins/sagemaker-ai/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
This plugin brings deep AWS AI/ML expertise directly into your coding assistant, covering the surface area of [Amazon SageMaker AI](https://aws.amazon.com/sagemaker/ai/); currently, skills are provided to assist with the following capability areas:

- **Model Customization** — End-to-end guided workflows for fine-tuning foundation models, from use case definition through data preparation, training, evaluation, and deployment on Amazon SageMaker AI.
- **AI Optimization** — Benchmarking LLM inference performance and getting deployment recommendations for the best instance type, serving configuration, and optimizations.
- **HyperPod Cluster Operations** — Remote command execution on nodes via SSM, version checking, and diagnostic reporting for SageMaker HyperPod training clusters.

## Agent Skills
Expand All @@ -21,6 +22,7 @@ This plugin brings deep AWS AI/ML expertise directly into your coding assistant,
| 10 | `hyperpod-ssm` | Remote command execution and file transfer on HyperPod cluster nodes via SSM | [SKILL.md](skills/hyperpod-ssm/SKILL.md) |
| 11 | `hyperpod-version-checker` | Check and compare software component versions across HyperPod cluster nodes | [SKILL.md](skills/hyperpod-version-checker/SKILL.md) |
| 12 | `hyperpod-issue-report` | Generate diagnostic reports for HyperPod troubleshooting and support cases | [SKILL.md](skills/hyperpod-issue-report/SKILL.md) |
| 13 | `ai-optimization` | Guided workflows for benchmarking LLM inference and getting deployment recommendations (best instance, optimizations) | [SKILL.md](skills/ai-optimization/SKILL.md) |

## MCP Servers

Expand Down Expand Up @@ -181,6 +183,29 @@ Learn more about AWS Identity and Access Management for Amazon SageMaker AI [her

The skills in this plugin encode AWS best practices, but they are fully customizable. You can fork the repository and modify any `SKILL.md` to reflect your organization's standards, approved techniques, required evaluation benchmarks, or internal tooling. Workspace-level skills take precedence over global skills, so teams can maintain their own versions without affecting other users.

## AI Optimization

The AI Optimization skill covers the SageMaker AI Optimization APIs for benchmarking and optimizing LLM inference performance. It guides users through:

- **Workload configuration** — Define traffic patterns (request shape, concurrency, dataset) for benchmarking
- **Benchmark jobs** — Measure inference performance (latency, throughput, cost) on existing SageMaker endpoints
- **Recommendation jobs** — Automatically find the best instance type, serving configuration, and optimizations (kernel tuning, speculative decoding) for a model

### How It Works

- **Define your workload** — Describe your expected traffic pattern (input/output token lengths, concurrency). The skill creates an AIWorkloadConfig.
- **Benchmark or recommend** — Either benchmark an existing endpoint, or provide a model S3 URI and let the service evaluate multiple instance types automatically.
- **Review results** — The skill presents ranked recommendations with expected performance metrics, optimization details, and deployable ModelPackages.
- **Deploy** — Each recommendation includes a ModelPackage that can be deployed directly to a SageMaker endpoint with one click.

### Examples

- "Benchmark my SageMaker endpoint"
- "Find the best instance type for my Llama model"
- "Optimize inference cost for my model in S3"
- "Create a recommendation job for throughput optimization"
- "What's the cheapest way to serve my model?"

## Related Resources

- [Amazon SageMaker AI Model Customization](https://aws.amazon.com/sagemaker/ai/model-customization/)
Expand Down
111 changes: 111 additions & 0 deletions plugins/sagemaker-ai/skills/ai-optimization/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
name: ai-optimization
description: Guides users through SageMaker AI Optimization APIs for benchmarking and optimizing LLM inference. Covers workload configuration, benchmark jobs, and recommendation jobs that find the best instance type, optimization strategy, and serving configuration for a model. Use when the user says "benchmark my model", "optimize inference", "find the best instance", "recommendation job", "workload config", "AI benchmark", "AI recommendation", "reduce inference cost", "improve latency", or "optimize throughput".
metadata:
version: "1.0.0"
---

# AI Optimization

Guide users through SageMaker AI Optimization APIs to benchmark LLM inference performance and get deployment recommendations.

## Scope

This skill covers the **SageMaker AI Optimization** APIs, which help users:

- **Benchmark** an existing SageMaker endpoint to measure inference performance (latency, throughput, cost)
- **Get recommendations** for the best instance type, serving configuration, and optional optimizations (kernel tuning, speculative decoding) for deploying a model

### Three Resource Types

| Resource | Purpose |
| ----------------------- | -------------------------------------------------------------------------------------------------------- |
| **AIWorkloadConfig** | Defines the traffic pattern (request shape, concurrency, dataset) for benchmarking |
| **AIBenchmarkJob** | Runs a benchmark against a live SageMaker endpoint using a workload config |
| **AIRecommendationJob** | Analyzes a model, deploys it on candidate instances, benchmarks each, and returns ranked recommendations |

### 14 API Operations

| Resource | Create | Describe | Delete | List | Stop |
| ------------------- | ------ | -------- | ------ | ---- | ---- |
| AIWorkloadConfig | ✓ | ✓ | ✓ | ✓ | |
| AIBenchmarkJob | ✓ | ✓ | ✓ | ✓ | ✓ |
| AIRecommendationJob | ✓ | ✓ | ✓ | ✓ | ✓ |

## Principles

1. **One thing at a time.** Each response advances exactly one decision.
2. **Confirm before proceeding.** Wait for the user to agree before moving to the next step.
3. **Don't read files until you need them.** Only read reference files when you've reached the step that requires them.
4. **Use what you know.** If the answer is in conversation history or any file you've already read, use it.
5. **No narration.** Share outcomes and ask questions. Keep responses short.
6. **Notebook writing.** Write notebooks using your standard file write tool to create the `.ipynb` file with the complete notebook JSON, OR use notebook MCP tools if available. Do NOT use bash commands to generate notebooks.

## Workflow

### Step 1: Determine the User's Goal

Check conversation history first. The user typically wants one of:

1. **Benchmark an existing endpoint** — They already have a deployed model and want performance metrics.
2. **Get deployment recommendations** — They have a model in S3 and want to know the best instance type and configuration.
3. **Both** — Benchmark first, then optimize.

If unclear, ask:

> "What would you like to do?
>
> 1. **Benchmark** — Measure performance of an existing SageMaker endpoint
> 2. **Get recommendations** — Find the best instance type and configuration for a model in S3
>
> Pick one, or describe what you're trying to achieve."

⏸ Wait for user.

- If benchmark → go to Step 2A.
- If recommendations → go to Step 2B.

### Step 2A: Benchmark an Existing Endpoint

Read `references/benchmark-workflow.md` and follow its instructions.

### Step 2B: Get Deployment Recommendations

Read `references/recommendation-workflow.md` and follow its instructions.

### Step 3: Review Results

After the job completes:

- For **benchmark jobs**: present the performance metrics (latency percentiles, throughput, cost estimates).
- For **recommendation jobs**: present the ranked recommendations with instance type, expected performance, and optimization details.

Read `references/interpreting-results.md` for guidance on presenting results to the user.

### Step 4: Next Steps

After presenting results, offer relevant next steps:

> "What would you like to do next?
>
> - **Deploy the recommended configuration** — I can help create a SageMaker endpoint using the top recommendation
> - **Run another benchmark** — Test with different parameters or a different workload
> - **Compare results** — Run recommendations with different performance targets (cost vs latency vs throughput)"

## Prerequisites

- **AWS credentials** configured (via AWS CLI, environment variables, or SageMaker Space)
- **IAM role** with SageMaker permissions (`AmazonSageMakerFullAccess` or equivalent)
- For benchmarking: a deployed SageMaker endpoint
- For recommendations: a model stored in S3 (HuggingFace format)

## Troubleshooting

### Common Issues

| Issue | Cause | Fix |
| --------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------- |
| Job stuck in Pending | No available capacity for the requested instance type | Try a different instance type or wait for capacity |
| Job failed with "ResourceLimitExceeded" | Account quota exceeded | Request a quota increase for the instance type |
| Benchmark metrics look wrong | Workload config doesn't match the model's capabilities | Adjust token counts and concurrency in the workload config |
| Recommendation job failed | Model format not supported or S3 path incorrect | Verify the model is in HuggingFace format and the S3 URI is correct |
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Benchmark Results Download

Generate a notebook cell that downloads and displays benchmark results. The output is stored as an `output.tar.gz` archive — the primary metrics file is `profile_export_aiperf.json`.

```python
import io
import json
import tarfile
from urllib.parse import urlparse

import boto3

# sm client is defined in a prior cell (Step 3)
result = sm.describe_ai_benchmark_job(AIBenchmarkJobName="my-benchmark-job")
s3_output = result["OutputConfig"]["S3OutputLocation"]

print(f"Job status: {result['AIBenchmarkJobStatus']}")
print(f"Results location: {s3_output}")

# Download the output.tar.gz archive from S3
s3 = boto3.client("s3")
parsed = urlparse(s3_output)
bucket = parsed.netloc
prefix = parsed.path.lstrip("/")

try:
# Find the tar.gz file
objects = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
tar_key = None
for obj in objects.get("Contents", []):
if obj["Key"].endswith(".tar.gz"):
tar_key = obj["Key"]
break

if not tar_key:
raise FileNotFoundError(f"No tar.gz archive found at s3://{bucket}/{prefix}")

# Download and extract the primary metrics file
tar_bytes = s3.get_object(Bucket=bucket, Key=tar_key)["Body"].read()

with tarfile.open(fileobj=io.BytesIO(tar_bytes), mode="r:gz") as tar:
print(f"Archive contents: {tar.getnames()}")

metrics_data = None
for member in tar.getmembers():
if "profile_export_aiperf.json" in member.name:
f = tar.extractfile(member)
if f:
metrics_data = json.loads(f.read().decode("utf-8"))
break

if not metrics_data:
raise FileNotFoundError("profile_export_aiperf.json not found in archive")

# Display key metrics as a summary table
summary_metrics = [
"time_to_first_token", "inter_token_latency",
"output_token_throughput", "request_throughput",
"request_latency",
]
rows = []
for key in summary_metrics:
metric = metrics_data.get(key)
if isinstance(metric, dict) and "unit" in metric:
rows.append({
"Metric": key,
"p50": metric.get("p50"),
"p90": metric.get("p90"),
"p99": metric.get("p99"),
"avg": metric.get("avg"),
"Unit": metric.get("unit"),
})
if rows:
# Format as aligned text table (no pandas dependency)
header = f"{'Metric':<30} {'p50':>10} {'p90':>10} {'p99':>10} {'avg':>10} {'Unit':<15}"
print(header)
print("-" * len(header))
for r in rows:
print(f"{r['Metric']:<30} {r['p50'] or '':>10} {r['p90'] or '':>10} "
f"{r['p99'] or '':>10} {r['avg'] or '':>10} {r['Unit']:<15}")
else:
print("No recognized metrics found in profile_export_aiperf.json")

except FileNotFoundError as e:
print(f"Results not available: {e}")
except Exception as e:
print(f"Error downloading results: {e}")
print("Check that the IAM role has s3:GetObject and s3:ListBucket permissions.")
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Benchmark Workflow

Guide the user through creating and running an AI Benchmark Job.

## Step 1: Gather Endpoint Information

You need:

- **Endpoint name** — The SageMaker endpoint to benchmark
- **Inference components** (optional) — If the endpoint uses inference components, which ones to target

If not already known, ask:

> "What's the name of the SageMaker endpoint you want to benchmark? If it uses inference components, let me know which ones to target."

Use the AWS MCP tool `describe-endpoint` to verify the endpoint exists and is InService. If the user specified inference components, also use `describe-inference-component` to verify they exist on the endpoint.

## Step 2: Create a Workload Config

A workload config defines the traffic pattern. Key parameters:

| Parameter | Description | Default |
| ---------------------------- | -------------------------- | ------- |
| `prompt_input_tokens_mean` | Average input token count | 512 |
| `prompt_input_tokens_stddev` | Std dev of input tokens | 50 |
| `output_tokens_mean` | Average output token count | 256 |
| `output_tokens_stddev` | Std dev of output tokens | 30 |
| `concurrency` | Concurrent requests | 1 |
| `request_count` | Total requests to send | 100 |

Ask the user:

> "What's the typical input/output length (in tokens) and concurrency? Or I can use sensible defaults."

⏸ Wait for user.

Generate a notebook cell that creates the workload config:

```python
import boto3
import json

sm = boto3.client("sagemaker")

workload_spec = {
"benchmark": {"type": "aiperf"},
"parameters": {
"prompt_input_tokens_mean": 512, # Adjust based on user input
"prompt_input_tokens_stddev": 50,
"output_tokens_mean": 256, # Adjust based on user input
"output_tokens_stddev": 30,
"concurrency": 1, # Adjust based on user input
"request_count": 100,
},
}

sm.create_ai_workload_config(
AIWorkloadConfigName="my-workload-config",
AIWorkloadConfigs={"WorkloadSpec": {"Inline": json.dumps(workload_spec)}},
)
```

## Step 3: Create the Benchmark Job

Generate a notebook cell that creates and monitors the benchmark job:

```python
import time

sm.create_ai_benchmark_job(
AIBenchmarkJobName="my-benchmark-job",
AIWorkloadConfigIdentifier="my-workload-config",
RoleArn="<ROLE_ARN>", # User's IAM role
BenchmarkTarget={
"Endpoint": {"Identifier": "<ENDPOINT_NAME>"}
},
OutputConfig={
"S3OutputLocation": "s3://<BUCKET>/benchmark-results/"
},
)

# Poll until complete (timeout after 1 hour)
MAX_WAIT = 3600
start = time.time()
while time.time() - start < MAX_WAIT:
resp = sm.describe_ai_benchmark_job(AIBenchmarkJobName="my-benchmark-job")
status = resp["AIBenchmarkJobStatus"]
print(f"Status: {status} ({int(time.time() - start)}s elapsed)")
if status in ("Completed", "Failed", "Stopped"):
break
time.sleep(30)
else:
raise TimeoutError("Benchmark job did not complete within 1 hour")

if status == "Failed":
print(f"Benchmark failed: {resp.get('FailureReason', 'Unknown')}")
elif status == "Stopped":
print("Benchmark was stopped before completion.")
else:
print("Benchmark completed successfully.")
```

## Step 4: Present Results

When the job completes, read `benchmark-results.md` for the code to download and display results.

Return to the main SKILL.md Step 3 (Review Results).
Loading
Loading