awslabs · Lokiiiiii · Apr 24, 2026
@@ -3,6 +3,7 @@
 This plugin brings deep AWS AI/ML expertise directly into your coding assistant, covering the surface area of [Amazon SageMaker AI](https://aws.amazon.com/sagemaker/ai/); currently, skills are provided to assist with the following capability areas:
 
 - **Model Customization** — End-to-end guided workflows for fine-tuning foundation models, from use case definition through data preparation, training, evaluation, and deployment on Amazon SageMaker AI.
+- **AI Optimization** — Benchmarking LLM inference performance and getting deployment recommendations for the best instance type, serving configuration, and optimizations.
 - **HyperPod Cluster Operations** — Remote command execution on nodes via SSM, version checking, and diagnostic reporting for SageMaker HyperPod training clusters.
 
 ## Agent Skills
@@ -21,6 +22,7 @@ This plugin brings deep AWS AI/ML expertise directly into your coding assistant,
 | 10 | `hyperpod-ssm`             | Remote command execution and file transfer on HyperPod cluster nodes via SSM                                             | [SKILL.md](skills/hyperpod-ssm/SKILL.md)             |
 | 11 | `hyperpod-version-checker` | Check and compare software component versions across HyperPod cluster nodes                                              | [SKILL.md](skills/hyperpod-version-checker/SKILL.md) |
 | 12 | `hyperpod-issue-report`    | Generate diagnostic reports for HyperPod troubleshooting and support cases                                               | [SKILL.md](skills/hyperpod-issue-report/SKILL.md)    |
+| 13 | `ai-optimization`          | Guided workflows for benchmarking LLM inference and getting deployment recommendations (best instance, optimizations)    | [SKILL.md](skills/ai-optimization/SKILL.md)          |
 
 ## MCP Servers
 
@@ -181,6 +183,29 @@ Learn more about AWS Identity and Access Management for Amazon SageMaker AI [her
 
 The skills in this plugin encode AWS best practices, but they are fully customizable. You can fork the repository and modify any `SKILL.md` to reflect your organization's standards, approved techniques, required evaluation benchmarks, or internal tooling. Workspace-level skills take precedence over global skills, so teams can maintain their own versions without affecting other users.
 
+## AI Optimization
+
+The AI Optimization skill covers the SageMaker AI Optimization APIs for benchmarking and optimizing LLM inference performance. It guides users through:
+
+- **Workload configuration** — Define traffic patterns (request shape, concurrency, dataset) for benchmarking
+- **Benchmark jobs** — Measure inference performance (latency, throughput, cost) on existing SageMaker endpoints
+- **Recommendation jobs** — Automatically find the best instance type, serving configuration, and optimizations (kernel tuning, speculative decoding) for a model
+
+### How It Works
+
+- **Define your workload** — Describe your expected traffic pattern (input/output token lengths, concurrency). The skill creates an AIWorkloadConfig.
+- **Benchmark or recommend** — Either benchmark an existing endpoint, or provide a model S3 URI and let the service evaluate multiple instance types automatically.
+- **Review results** — The skill presents ranked recommendations with expected performance metrics, optimization details, and deployable ModelPackages.
+- **Deploy** — Each recommendation includes a ModelPackage that can be deployed directly to a SageMaker endpoint with one click.
+
+### Examples
+
+- "Benchmark my SageMaker endpoint"
+- "Find the best instance type for my Llama model"
+- "Optimize inference cost for my model in S3"
+- "Create a recommendation job for throughput optimization"
+- "What's the cheapest way to serve my model?"
+
 ## Related Resources
 
 - [Amazon SageMaker AI Model Customization](https://aws.amazon.com/sagemaker/ai/model-customization/)

@@ -0,0 +1,111 @@
+---
+name: ai-optimization
+description: Guides users through SageMaker AI Optimization APIs for benchmarking and optimizing LLM inference. Covers workload configuration, benchmark jobs, and recommendation jobs that find the best instance type, optimization strategy, and serving configuration for a model. Use when the user says "benchmark my model", "optimize inference", "find the best instance", "recommendation job", "workload config", "AI benchmark", "AI recommendation", "reduce inference cost", "improve latency", or "optimize throughput".
+metadata:
+  version: "1.0.0"
+---
+
+# AI Optimization
+
+Guide users through SageMaker AI Optimization APIs to benchmark LLM inference performance and get deployment recommendations.
+
+## Scope
+
+This skill covers the **SageMaker AI Optimization** APIs, which help users:
+
+- **Benchmark** an existing SageMaker endpoint to measure inference performance (latency, throughput, cost)
+- **Get recommendations** for the best instance type, serving configuration, and optional optimizations (kernel tuning, speculative decoding) for deploying a model
+
+### Three Resource Types
+
+| Resource                | Purpose                                                                                                  |
+| ----------------------- | -------------------------------------------------------------------------------------------------------- |
+| **AIWorkloadConfig**    | Defines the traffic pattern (request shape, concurrency, dataset) for benchmarking                       |
+| **AIBenchmarkJob**      | Runs a benchmark against a live SageMaker endpoint using a workload config                               |
+| **AIRecommendationJob** | Analyzes a model, deploys it on candidate instances, benchmarks each, and returns ranked recommendations |
+
+### 14 API Operations
+
+| Resource            | Create | Describe | Delete | List | Stop |
+| ------------------- | ------ | -------- | ------ | ---- | ---- |
+| AIWorkloadConfig    | ✓      | ✓        | ✓      | ✓    |      |
+| AIBenchmarkJob      | ✓      | ✓        | ✓      | ✓    | ✓    |
+| AIRecommendationJob | ✓      | ✓        | ✓      | ✓    | ✓    |
+
+## Principles
+
+1. **One thing at a time.** Each response advances exactly one decision.
+2. **Confirm before proceeding.** Wait for the user to agree before moving to the next step.
+3. **Don't read files until you need them.** Only read reference files when you've reached the step that requires them.
+4. **Use what you know.** If the answer is in conversation history or any file you've already read, use it.
+5. **No narration.** Share outcomes and ask questions. Keep responses short.
+6. **Notebook writing.** Write notebooks using your standard file write tool to create the `.ipynb` file with the complete notebook JSON, OR use notebook MCP tools if available. Do NOT use bash commands to generate notebooks.
+
+## Workflow
+
+### Step 1: Determine the User's Goal
+
+Check conversation history first. The user typically wants one of:
+
+1. **Benchmark an existing endpoint** — They already have a deployed model and want performance metrics.
+2. **Get deployment recommendations** — They have a model in S3 and want to know the best instance type and configuration.
+3. **Both** — Benchmark first, then optimize.
+
+If unclear, ask:
+
+> "What would you like to do?
+>
+> 1. **Benchmark** — Measure performance of an existing SageMaker endpoint
+> 2. **Get recommendations** — Find the best instance type and configuration for a model in S3
+>
+> Pick one, or describe what you're trying to achieve."
+
+⏸ Wait for user.
+
+- If benchmark → go to Step 2A.
+- If recommendations → go to Step 2B.
+
+### Step 2A: Benchmark an Existing Endpoint
+
+Read `references/benchmark-workflow.md` and follow its instructions.
+
+### Step 2B: Get Deployment Recommendations
+
+Read `references/recommendation-workflow.md` and follow its instructions.
+
+### Step 3: Review Results
+
+After the job completes:
+
+- For **benchmark jobs**: present the performance metrics (latency percentiles, throughput, cost estimates).
+- For **recommendation jobs**: present the ranked recommendations with instance type, expected performance, and optimization details.
+
+Read `references/interpreting-results.md` for guidance on presenting results to the user.
+
+### Step 4: Next Steps
+
+After presenting results, offer relevant next steps:
+
+> "What would you like to do next?
+>
+> - **Deploy the recommended configuration** — I can help create a SageMaker endpoint using the top recommendation
+> - **Run another benchmark** — Test with different parameters or a different workload
+> - **Compare results** — Run recommendations with different performance targets (cost vs latency vs throughput)"
+
+## Prerequisites
+
+- **AWS credentials** configured (via AWS CLI, environment variables, or SageMaker Space)
+- **IAM role** with SageMaker permissions (`AmazonSageMakerFullAccess` or equivalent)
+- For benchmarking: a deployed SageMaker endpoint
+- For recommendations: a model stored in S3 (HuggingFace format)
+
+## Troubleshooting
+
+### Common Issues
+
+| Issue                                   | Cause                                                  | Fix                                                                 |
+| --------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------- |
+| Job stuck in Pending                    | No available capacity for the requested instance type  | Try a different instance type or wait for capacity                  |
+| Job failed with "ResourceLimitExceeded" | Account quota exceeded                                 | Request a quota increase for the instance type                      |
+| Benchmark metrics look wrong            | Workload config doesn't match the model's capabilities | Adjust token counts and concurrency in the workload config          |
+| Recommendation job failed               | Model format not supported or S3 path incorrect        | Verify the model is in HuggingFace format and the S3 URI is correct |
@@ -0,0 +1,89 @@
+# Benchmark Results Download
+
+Generate a notebook cell that downloads and displays benchmark results. The output is stored as an `output.tar.gz` archive — the primary metrics file is `profile_export_aiperf.json`.
+
+```python
+import io
+import json
+import tarfile
+from urllib.parse import urlparse
+
+import boto3
+
+# sm client is defined in a prior cell (Step 3)
+result = sm.describe_ai_benchmark_job(AIBenchmarkJobName="my-benchmark-job")
+s3_output = result["OutputConfig"]["S3OutputLocation"]
+
+print(f"Job status: {result['AIBenchmarkJobStatus']}")
+print(f"Results location: {s3_output}")
+
+# Download the output.tar.gz archive from S3
+s3 = boto3.client("s3")
+parsed = urlparse(s3_output)
+bucket = parsed.netloc
+prefix = parsed.path.lstrip("/")
+
+try:
+    # Find the tar.gz file
+    objects = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
+    tar_key = None
+    for obj in objects.get("Contents", []):
+        if obj["Key"].endswith(".tar.gz"):
+            tar_key = obj["Key"]
+            break
+
+    if not tar_key:
+        raise FileNotFoundError(f"No tar.gz archive found at s3://{bucket}/{prefix}")
+
+    # Download and extract the primary metrics file
+    tar_bytes = s3.get_object(Bucket=bucket, Key=tar_key)["Body"].read()
+
+    with tarfile.open(fileobj=io.BytesIO(tar_bytes), mode="r:gz") as tar:
+        print(f"Archive contents: {tar.getnames()}")
+
+        metrics_data = None
+        for member in tar.getmembers():
+            if "profile_export_aiperf.json" in member.name:
+                f = tar.extractfile(member)
+                if f:
+                    metrics_data = json.loads(f.read().decode("utf-8"))
+                    break
+
+    if not metrics_data:
+        raise FileNotFoundError("profile_export_aiperf.json not found in archive")
+
+    # Display key metrics as a summary table
+    summary_metrics = [
+        "time_to_first_token", "inter_token_latency",
+        "output_token_throughput", "request_throughput",
+        "request_latency",
+    ]
+    rows = []
+    for key in summary_metrics:
+        metric = metrics_data.get(key)
+        if isinstance(metric, dict) and "unit" in metric:
+            rows.append({
+                "Metric": key,
+                "p50": metric.get("p50"),
+                "p90": metric.get("p90"),
+                "p99": metric.get("p99"),
+                "avg": metric.get("avg"),
+                "Unit": metric.get("unit"),
+            })
+    if rows:
+        # Format as aligned text table (no pandas dependency)
+        header = f"{'Metric':<30} {'p50':>10} {'p90':>10} {'p99':>10} {'avg':>10} {'Unit':<15}"
+        print(header)
+        print("-" * len(header))
+        for r in rows:
+            print(f"{r['Metric']:<30} {r['p50'] or '':>10} {r['p90'] or '':>10} "
+                  f"{r['p99'] or '':>10} {r['avg'] or '':>10} {r['Unit']:<15}")
+    else:
+        print("No recognized metrics found in profile_export_aiperf.json")
+
+except FileNotFoundError as e:
+    print(f"Results not available: {e}")
+except Exception as e:
+    print(f"Error downloading results: {e}")
+    print("Check that the IAM role has s3:GetObject and s3:ListBucket permissions.")
+```
@@ -0,0 +1,107 @@
+# Benchmark Workflow
+
+Guide the user through creating and running an AI Benchmark Job.
+
+## Step 1: Gather Endpoint Information
+
+You need:
+
+- **Endpoint name** — The SageMaker endpoint to benchmark
+- **Inference components** (optional) — If the endpoint uses inference components, which ones to target
+
+If not already known, ask:
+
+> "What's the name of the SageMaker endpoint you want to benchmark? If it uses inference components, let me know which ones to target."
+
+Use the AWS MCP tool `describe-endpoint` to verify the endpoint exists and is InService. If the user specified inference components, also use `describe-inference-component` to verify they exist on the endpoint.
+
+## Step 2: Create a Workload Config
+
+A workload config defines the traffic pattern. Key parameters:
+
+| Parameter                    | Description                | Default |
+| ---------------------------- | -------------------------- | ------- |
+| `prompt_input_tokens_mean`   | Average input token count  | 512     |
+| `prompt_input_tokens_stddev` | Std dev of input tokens    | 50      |
+| `output_tokens_mean`         | Average output token count | 256     |
+| `output_tokens_stddev`       | Std dev of output tokens   | 30      |
+| `concurrency`                | Concurrent requests        | 1       |
+| `request_count`              | Total requests to send     | 100     |
+
+Ask the user:
+
+> "What's the typical input/output length (in tokens) and concurrency? Or I can use sensible defaults."
+
+⏸ Wait for user.
+
+Generate a notebook cell that creates the workload config:
+
+```python
+import boto3
+import json
+
+sm = boto3.client("sagemaker")
+
+workload_spec = {
+    "benchmark": {"type": "aiperf"},
+    "parameters": {
+        "prompt_input_tokens_mean": 512,       # Adjust based on user input
+        "prompt_input_tokens_stddev": 50,
+        "output_tokens_mean": 256,             # Adjust based on user input
+        "output_tokens_stddev": 30,
+        "concurrency": 1,                      # Adjust based on user input
+        "request_count": 100,
+    },
+}
+
+sm.create_ai_workload_config(
+    AIWorkloadConfigName="my-workload-config",
+    AIWorkloadConfigs={"WorkloadSpec": {"Inline": json.dumps(workload_spec)}},
+)
+```
+
+## Step 3: Create the Benchmark Job
+
+Generate a notebook cell that creates and monitors the benchmark job:
+
+```python
+import time
+
+sm.create_ai_benchmark_job(
+    AIBenchmarkJobName="my-benchmark-job",
+    AIWorkloadConfigIdentifier="my-workload-config",
+    RoleArn="<ROLE_ARN>",                      # User's IAM role
+    BenchmarkTarget={
+        "Endpoint": {"Identifier": "<ENDPOINT_NAME>"}
+    },
+    OutputConfig={
+        "S3OutputLocation": "s3://<BUCKET>/benchmark-results/"
+    },
+)
+
+# Poll until complete (timeout after 1 hour)
+MAX_WAIT = 3600
+start = time.time()
+while time.time() - start < MAX_WAIT:
+    resp = sm.describe_ai_benchmark_job(AIBenchmarkJobName="my-benchmark-job")
+    status = resp["AIBenchmarkJobStatus"]
+    print(f"Status: {status} ({int(time.time() - start)}s elapsed)")
+    if status in ("Completed", "Failed", "Stopped"):
+        break
+    time.sleep(30)
+else:
+    raise TimeoutError("Benchmark job did not complete within 1 hour")
+
+if status == "Failed":
+    print(f"Benchmark failed: {resp.get('FailureReason', 'Unknown')}")
+elif status == "Stopped":
+    print("Benchmark was stopped before completion.")
+else:
+    print("Benchmark completed successfully.")
+```
+
+## Step 4: Present Results
+
+When the job completes, read `benchmark-results.md` for the code to download and display results.
+
+Return to the main SKILL.md Step 3 (Review Results).