From 38bdf3e21a00b1adb7566dfca289761aa5dbccd4 Mon Sep 17 00:00:00 2001 From: Loki Ravi Date: Fri, 24 Apr 2026 17:03:30 +0000 Subject: [PATCH] Add ai-optimization skill for SageMaker AI Optimization APIs New skill covering the 14 SageMaker AI Optimization API operations (AIWorkloadConfig, AIBenchmarkJob, AIRecommendationJob) with guided workflows for benchmarking LLM inference and getting deployment recommendations. Skill structure: - SKILL.md (110 lines): Main skill with intent-matching description - references/benchmark-workflow.md (96 lines): Benchmark job guide - references/benchmark-results.md (78 lines): Results download code - references/recommendation-workflow.md (96 lines): Recommendation guide - references/recommendation-options.md (74 lines): Config options + dataset - references/recommendation-deploy.md (41 lines): ModelPackage deployment - references/interpreting-results.md (78 lines): Metrics presentation All files conform to DESIGN_GUIDELINES.md limits (SKILL.md <300, references <100 lines each). Code samples verified against the public Smithy model. --- plugins/sagemaker-ai/README.md | 25 ++++ .../skills/ai-optimization/SKILL.md | 111 ++++++++++++++++++ .../references/benchmark-results.md | 89 ++++++++++++++ .../references/benchmark-workflow.md | 107 +++++++++++++++++ .../references/interpreting-results.md | 78 ++++++++++++ .../references/recommendation-deploy.md | 41 +++++++ .../references/recommendation-options.md | 74 ++++++++++++ .../references/recommendation-workflow.md | 100 ++++++++++++++++ 8 files changed, 625 insertions(+) create mode 100644 plugins/sagemaker-ai/skills/ai-optimization/SKILL.md create mode 100644 plugins/sagemaker-ai/skills/ai-optimization/references/benchmark-results.md create mode 100644 plugins/sagemaker-ai/skills/ai-optimization/references/benchmark-workflow.md create mode 100644 plugins/sagemaker-ai/skills/ai-optimization/references/interpreting-results.md create mode 100644 plugins/sagemaker-ai/skills/ai-optimization/references/recommendation-deploy.md create mode 100644 plugins/sagemaker-ai/skills/ai-optimization/references/recommendation-options.md create mode 100644 plugins/sagemaker-ai/skills/ai-optimization/references/recommendation-workflow.md diff --git a/plugins/sagemaker-ai/README.md b/plugins/sagemaker-ai/README.md index 764821f..4645812 100644 --- a/plugins/sagemaker-ai/README.md +++ b/plugins/sagemaker-ai/README.md @@ -3,6 +3,7 @@ This plugin brings deep AWS AI/ML expertise directly into your coding assistant, covering the surface area of [Amazon SageMaker AI](https://aws.amazon.com/sagemaker/ai/); currently, skills are provided to assist with the following capability areas: - **Model Customization** — End-to-end guided workflows for fine-tuning foundation models, from use case definition through data preparation, training, evaluation, and deployment on Amazon SageMaker AI. +- **AI Optimization** — Benchmarking LLM inference performance and getting deployment recommendations for the best instance type, serving configuration, and optimizations. - **HyperPod Cluster Operations** — Remote command execution on nodes via SSM, version checking, and diagnostic reporting for SageMaker HyperPod training clusters. ## Agent Skills @@ -21,6 +22,7 @@ This plugin brings deep AWS AI/ML expertise directly into your coding assistant, | 10 | `hyperpod-ssm` | Remote command execution and file transfer on HyperPod cluster nodes via SSM | [SKILL.md](skills/hyperpod-ssm/SKILL.md) | | 11 | `hyperpod-version-checker` | Check and compare software component versions across HyperPod cluster nodes | [SKILL.md](skills/hyperpod-version-checker/SKILL.md) | | 12 | `hyperpod-issue-report` | Generate diagnostic reports for HyperPod troubleshooting and support cases | [SKILL.md](skills/hyperpod-issue-report/SKILL.md) | +| 13 | `ai-optimization` | Guided workflows for benchmarking LLM inference and getting deployment recommendations (best instance, optimizations) | [SKILL.md](skills/ai-optimization/SKILL.md) | ## MCP Servers @@ -181,6 +183,29 @@ Learn more about AWS Identity and Access Management for Amazon SageMaker AI [her The skills in this plugin encode AWS best practices, but they are fully customizable. You can fork the repository and modify any `SKILL.md` to reflect your organization's standards, approved techniques, required evaluation benchmarks, or internal tooling. Workspace-level skills take precedence over global skills, so teams can maintain their own versions without affecting other users. +## AI Optimization + +The AI Optimization skill covers the SageMaker AI Optimization APIs for benchmarking and optimizing LLM inference performance. It guides users through: + +- **Workload configuration** — Define traffic patterns (request shape, concurrency, dataset) for benchmarking +- **Benchmark jobs** — Measure inference performance (latency, throughput, cost) on existing SageMaker endpoints +- **Recommendation jobs** — Automatically find the best instance type, serving configuration, and optimizations (kernel tuning, speculative decoding) for a model + +### How It Works + +- **Define your workload** — Describe your expected traffic pattern (input/output token lengths, concurrency). The skill creates an AIWorkloadConfig. +- **Benchmark or recommend** — Either benchmark an existing endpoint, or provide a model S3 URI and let the service evaluate multiple instance types automatically. +- **Review results** — The skill presents ranked recommendations with expected performance metrics, optimization details, and deployable ModelPackages. +- **Deploy** — Each recommendation includes a ModelPackage that can be deployed directly to a SageMaker endpoint with one click. + +### Examples + +- "Benchmark my SageMaker endpoint" +- "Find the best instance type for my Llama model" +- "Optimize inference cost for my model in S3" +- "Create a recommendation job for throughput optimization" +- "What's the cheapest way to serve my model?" + ## Related Resources - [Amazon SageMaker AI Model Customization](https://aws.amazon.com/sagemaker/ai/model-customization/) diff --git a/plugins/sagemaker-ai/skills/ai-optimization/SKILL.md b/plugins/sagemaker-ai/skills/ai-optimization/SKILL.md new file mode 100644 index 0000000..693c8df --- /dev/null +++ b/plugins/sagemaker-ai/skills/ai-optimization/SKILL.md @@ -0,0 +1,111 @@ +--- +name: ai-optimization +description: Guides users through SageMaker AI Optimization APIs for benchmarking and optimizing LLM inference. Covers workload configuration, benchmark jobs, and recommendation jobs that find the best instance type, optimization strategy, and serving configuration for a model. Use when the user says "benchmark my model", "optimize inference", "find the best instance", "recommendation job", "workload config", "AI benchmark", "AI recommendation", "reduce inference cost", "improve latency", or "optimize throughput". +metadata: + version: "1.0.0" +--- + +# AI Optimization + +Guide users through SageMaker AI Optimization APIs to benchmark LLM inference performance and get deployment recommendations. + +## Scope + +This skill covers the **SageMaker AI Optimization** APIs, which help users: + +- **Benchmark** an existing SageMaker endpoint to measure inference performance (latency, throughput, cost) +- **Get recommendations** for the best instance type, serving configuration, and optional optimizations (kernel tuning, speculative decoding) for deploying a model + +### Three Resource Types + +| Resource | Purpose | +| ----------------------- | -------------------------------------------------------------------------------------------------------- | +| **AIWorkloadConfig** | Defines the traffic pattern (request shape, concurrency, dataset) for benchmarking | +| **AIBenchmarkJob** | Runs a benchmark against a live SageMaker endpoint using a workload config | +| **AIRecommendationJob** | Analyzes a model, deploys it on candidate instances, benchmarks each, and returns ranked recommendations | + +### 14 API Operations + +| Resource | Create | Describe | Delete | List | Stop | +| ------------------- | ------ | -------- | ------ | ---- | ---- | +| AIWorkloadConfig | ✓ | ✓ | ✓ | ✓ | | +| AIBenchmarkJob | ✓ | ✓ | ✓ | ✓ | ✓ | +| AIRecommendationJob | ✓ | ✓ | ✓ | ✓ | ✓ | + +## Principles + +1. **One thing at a time.** Each response advances exactly one decision. +2. **Confirm before proceeding.** Wait for the user to agree before moving to the next step. +3. **Don't read files until you need them.** Only read reference files when you've reached the step that requires them. +4. **Use what you know.** If the answer is in conversation history or any file you've already read, use it. +5. **No narration.** Share outcomes and ask questions. Keep responses short. +6. **Notebook writing.** Write notebooks using your standard file write tool to create the `.ipynb` file with the complete notebook JSON, OR use notebook MCP tools if available. Do NOT use bash commands to generate notebooks. + +## Workflow + +### Step 1: Determine the User's Goal + +Check conversation history first. The user typically wants one of: + +1. **Benchmark an existing endpoint** — They already have a deployed model and want performance metrics. +2. **Get deployment recommendations** — They have a model in S3 and want to know the best instance type and configuration. +3. **Both** — Benchmark first, then optimize. + +If unclear, ask: + +> "What would you like to do? +> +> 1. **Benchmark** — Measure performance of an existing SageMaker endpoint +> 2. **Get recommendations** — Find the best instance type and configuration for a model in S3 +> +> Pick one, or describe what you're trying to achieve." + +⏸ Wait for user. + +- If benchmark → go to Step 2A. +- If recommendations → go to Step 2B. + +### Step 2A: Benchmark an Existing Endpoint + +Read `references/benchmark-workflow.md` and follow its instructions. + +### Step 2B: Get Deployment Recommendations + +Read `references/recommendation-workflow.md` and follow its instructions. + +### Step 3: Review Results + +After the job completes: + +- For **benchmark jobs**: present the performance metrics (latency percentiles, throughput, cost estimates). +- For **recommendation jobs**: present the ranked recommendations with instance type, expected performance, and optimization details. + +Read `references/interpreting-results.md` for guidance on presenting results to the user. + +### Step 4: Next Steps + +After presenting results, offer relevant next steps: + +> "What would you like to do next? +> +> - **Deploy the recommended configuration** — I can help create a SageMaker endpoint using the top recommendation +> - **Run another benchmark** — Test with different parameters or a different workload +> - **Compare results** — Run recommendations with different performance targets (cost vs latency vs throughput)" + +## Prerequisites + +- **AWS credentials** configured (via AWS CLI, environment variables, or SageMaker Space) +- **IAM role** with SageMaker permissions (`AmazonSageMakerFullAccess` or equivalent) +- For benchmarking: a deployed SageMaker endpoint +- For recommendations: a model stored in S3 (HuggingFace format) + +## Troubleshooting + +### Common Issues + +| Issue | Cause | Fix | +| --------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------------- | +| Job stuck in Pending | No available capacity for the requested instance type | Try a different instance type or wait for capacity | +| Job failed with "ResourceLimitExceeded" | Account quota exceeded | Request a quota increase for the instance type | +| Benchmark metrics look wrong | Workload config doesn't match the model's capabilities | Adjust token counts and concurrency in the workload config | +| Recommendation job failed | Model format not supported or S3 path incorrect | Verify the model is in HuggingFace format and the S3 URI is correct | diff --git a/plugins/sagemaker-ai/skills/ai-optimization/references/benchmark-results.md b/plugins/sagemaker-ai/skills/ai-optimization/references/benchmark-results.md new file mode 100644 index 0000000..283f5ba --- /dev/null +++ b/plugins/sagemaker-ai/skills/ai-optimization/references/benchmark-results.md @@ -0,0 +1,89 @@ +# Benchmark Results Download + +Generate a notebook cell that downloads and displays benchmark results. The output is stored as an `output.tar.gz` archive — the primary metrics file is `profile_export_aiperf.json`. + +```python +import io +import json +import tarfile +from urllib.parse import urlparse + +import boto3 + +# sm client is defined in a prior cell (Step 3) +result = sm.describe_ai_benchmark_job(AIBenchmarkJobName="my-benchmark-job") +s3_output = result["OutputConfig"]["S3OutputLocation"] + +print(f"Job status: {result['AIBenchmarkJobStatus']}") +print(f"Results location: {s3_output}") + +# Download the output.tar.gz archive from S3 +s3 = boto3.client("s3") +parsed = urlparse(s3_output) +bucket = parsed.netloc +prefix = parsed.path.lstrip("/") + +try: + # Find the tar.gz file + objects = s3.list_objects_v2(Bucket=bucket, Prefix=prefix) + tar_key = None + for obj in objects.get("Contents", []): + if obj["Key"].endswith(".tar.gz"): + tar_key = obj["Key"] + break + + if not tar_key: + raise FileNotFoundError(f"No tar.gz archive found at s3://{bucket}/{prefix}") + + # Download and extract the primary metrics file + tar_bytes = s3.get_object(Bucket=bucket, Key=tar_key)["Body"].read() + + with tarfile.open(fileobj=io.BytesIO(tar_bytes), mode="r:gz") as tar: + print(f"Archive contents: {tar.getnames()}") + + metrics_data = None + for member in tar.getmembers(): + if "profile_export_aiperf.json" in member.name: + f = tar.extractfile(member) + if f: + metrics_data = json.loads(f.read().decode("utf-8")) + break + + if not metrics_data: + raise FileNotFoundError("profile_export_aiperf.json not found in archive") + + # Display key metrics as a summary table + summary_metrics = [ + "time_to_first_token", "inter_token_latency", + "output_token_throughput", "request_throughput", + "request_latency", + ] + rows = [] + for key in summary_metrics: + metric = metrics_data.get(key) + if isinstance(metric, dict) and "unit" in metric: + rows.append({ + "Metric": key, + "p50": metric.get("p50"), + "p90": metric.get("p90"), + "p99": metric.get("p99"), + "avg": metric.get("avg"), + "Unit": metric.get("unit"), + }) + if rows: + # Format as aligned text table (no pandas dependency) + header = f"{'Metric':<30} {'p50':>10} {'p90':>10} {'p99':>10} {'avg':>10} {'Unit':<15}" + print(header) + print("-" * len(header)) + for r in rows: + print(f"{r['Metric']:<30} {r['p50'] or '':>10} {r['p90'] or '':>10} " + f"{r['p99'] or '':>10} {r['avg'] or '':>10} {r['Unit']:<15}") + else: + print("No recognized metrics found in profile_export_aiperf.json") + +except FileNotFoundError as e: + print(f"Results not available: {e}") +except Exception as e: + print(f"Error downloading results: {e}") + print("Check that the IAM role has s3:GetObject and s3:ListBucket permissions.") +``` diff --git a/plugins/sagemaker-ai/skills/ai-optimization/references/benchmark-workflow.md b/plugins/sagemaker-ai/skills/ai-optimization/references/benchmark-workflow.md new file mode 100644 index 0000000..f58662f --- /dev/null +++ b/plugins/sagemaker-ai/skills/ai-optimization/references/benchmark-workflow.md @@ -0,0 +1,107 @@ +# Benchmark Workflow + +Guide the user through creating and running an AI Benchmark Job. + +## Step 1: Gather Endpoint Information + +You need: + +- **Endpoint name** — The SageMaker endpoint to benchmark +- **Inference components** (optional) — If the endpoint uses inference components, which ones to target + +If not already known, ask: + +> "What's the name of the SageMaker endpoint you want to benchmark? If it uses inference components, let me know which ones to target." + +Use the AWS MCP tool `describe-endpoint` to verify the endpoint exists and is InService. If the user specified inference components, also use `describe-inference-component` to verify they exist on the endpoint. + +## Step 2: Create a Workload Config + +A workload config defines the traffic pattern. Key parameters: + +| Parameter | Description | Default | +| ---------------------------- | -------------------------- | ------- | +| `prompt_input_tokens_mean` | Average input token count | 512 | +| `prompt_input_tokens_stddev` | Std dev of input tokens | 50 | +| `output_tokens_mean` | Average output token count | 256 | +| `output_tokens_stddev` | Std dev of output tokens | 30 | +| `concurrency` | Concurrent requests | 1 | +| `request_count` | Total requests to send | 100 | + +Ask the user: + +> "What's the typical input/output length (in tokens) and concurrency? Or I can use sensible defaults." + +⏸ Wait for user. + +Generate a notebook cell that creates the workload config: + +```python +import boto3 +import json + +sm = boto3.client("sagemaker") + +workload_spec = { + "benchmark": {"type": "aiperf"}, + "parameters": { + "prompt_input_tokens_mean": 512, # Adjust based on user input + "prompt_input_tokens_stddev": 50, + "output_tokens_mean": 256, # Adjust based on user input + "output_tokens_stddev": 30, + "concurrency": 1, # Adjust based on user input + "request_count": 100, + }, +} + +sm.create_ai_workload_config( + AIWorkloadConfigName="my-workload-config", + AIWorkloadConfigs={"WorkloadSpec": {"Inline": json.dumps(workload_spec)}}, +) +``` + +## Step 3: Create the Benchmark Job + +Generate a notebook cell that creates and monitors the benchmark job: + +```python +import time + +sm.create_ai_benchmark_job( + AIBenchmarkJobName="my-benchmark-job", + AIWorkloadConfigIdentifier="my-workload-config", + RoleArn="", # User's IAM role + BenchmarkTarget={ + "Endpoint": {"Identifier": ""} + }, + OutputConfig={ + "S3OutputLocation": "s3:///benchmark-results/" + }, +) + +# Poll until complete (timeout after 1 hour) +MAX_WAIT = 3600 +start = time.time() +while time.time() - start < MAX_WAIT: + resp = sm.describe_ai_benchmark_job(AIBenchmarkJobName="my-benchmark-job") + status = resp["AIBenchmarkJobStatus"] + print(f"Status: {status} ({int(time.time() - start)}s elapsed)") + if status in ("Completed", "Failed", "Stopped"): + break + time.sleep(30) +else: + raise TimeoutError("Benchmark job did not complete within 1 hour") + +if status == "Failed": + print(f"Benchmark failed: {resp.get('FailureReason', 'Unknown')}") +elif status == "Stopped": + print("Benchmark was stopped before completion.") +else: + print("Benchmark completed successfully.") +``` + +## Step 4: Present Results + +When the job completes, read `benchmark-results.md` for the code to download and display results. + +Return to the main SKILL.md Step 3 (Review Results). diff --git a/plugins/sagemaker-ai/skills/ai-optimization/references/interpreting-results.md b/plugins/sagemaker-ai/skills/ai-optimization/references/interpreting-results.md new file mode 100644 index 0000000..47ca12a --- /dev/null +++ b/plugins/sagemaker-ai/skills/ai-optimization/references/interpreting-results.md @@ -0,0 +1,78 @@ +# Interpreting Results + +Guide for presenting benchmark and recommendation results to users. + +## Benchmark Job Results + +Present metrics in a clear table format: + +| Metric | Stat | Value | Unit | +| --------------------- | ---- | ----- | ---------- | +| TimeToFirstToken | p50 | 120 | ms | +| TimeToFirstToken | p90 | 180 | ms | +| TimeToFirstToken | p99 | 250 | ms | +| InterTokenLatency | p50 | 15 | ms | +| OutputTokenThroughput | avg | 45.2 | tokens/s | +| RequestThroughput | avg | 2.1 | requests/s | +| RequestLatency | p50 | 3200 | ms | + +Key insights to highlight: + +- **TTFT p50 vs p99** — Large gaps indicate inconsistent performance (possibly due to batching or cold starts) +- **OutputTokenThroughput** — Higher is better for batch workloads +- **Concurrency impact** — If the user ran multiple benchmarks at different concurrency levels, compare how metrics scale + +## Recommendation Job Results + +Present recommendations as a ranked table: + +> "Here are the recommendations, ranked by [cost/latency/throughput]: +> +> | # | Instance Type | TTFT p50 | Throughput | Optimizations | Est. Cost | +> | - | -------------- | -------- | ---------- | ------------- | --------- | +> | 1 | ml.g6.xlarge | 95ms | 42 tok/s | Kernel Tuning | $1.20/hr | +> | 2 | ml.g5.xlarge | 110ms | 38 tok/s | None | $1.00/hr | +> | 3 | ml.g6.12xlarge | 45ms | 120 tok/s | Kernel Tuning | $7.20/hr | + +### What Each Field Means + +- **Instance Type** — The GPU instance the model was tested on +- **TTFT (Time to First Token)** — How long until the first token is generated. Lower is better for interactive use cases. +- **Throughput (Output Token Throughput)** — Tokens generated per second. Higher is better for batch processing. +- **Optimizations** — What the service applied: + - **Kernel Tuning** — GPU kernel optimizations specific to the hardware. Improves throughput with no quality impact. + - **Speculative Decoding** — Uses a smaller draft model to predict tokens ahead. Improves latency but requires a compatible draft model. +- **CopyCountPerInstance** — How many model copies fit on one instance. More copies = higher throughput per instance. + +### Helping the User Choose + +Guide based on their performance target: + +- **Cost target** → Recommend #1 (lowest cost that meets baseline performance) +- **Latency target** → Recommend the one with lowest TTFT at the user's target percentile +- **Throughput target** → Recommend the one with highest OutputTokenThroughput + +If the user is unsure, suggest: + +> "For interactive applications (chatbots, real-time), prioritize low TTFT. +> For batch processing (summarization, translation), prioritize high throughput. +> For cost-sensitive workloads, the cost-optimized recommendation balances both." + +### ModelPackage Deployment + +Every recommendation includes a `ModelPackageArn` and `InferenceSpecificationName`. Explain: + +> "Each recommendation is packaged as a deployable ModelPackage. You can deploy any of them directly — the model weights, container image, environment variables, and optimization artifacts are all bundled together. +> +> Would you like me to generate deployment code for one of these recommendations?" + +## Performance Metrics Reference + +| Metric | Description | Unit | Stats Available | +| --------------------- | ------------------------------------------ | ---------- | --------------------------------- | +| TimeToFirstToken | Time from request to first generated token | ms | p50, p90, p95, p99, avg, min, max | +| InterTokenLatency | Time between consecutive tokens | ms | p50, p90, p95, p99, avg, min, max | +| OutputTokenThroughput | Tokens generated per second | tokens/s | avg | +| RequestThroughput | Requests completed per second | requests/s | avg | +| RequestLatency | Total time for a complete request | ms | p50, p90, p95, p99, avg, min, max | +| ClientSideConcurrency | Concurrency level used during benchmarking | count | — | diff --git a/plugins/sagemaker-ai/skills/ai-optimization/references/recommendation-deploy.md b/plugins/sagemaker-ai/skills/ai-optimization/references/recommendation-deploy.md new file mode 100644 index 0000000..9b1e44f --- /dev/null +++ b/plugins/sagemaker-ai/skills/ai-optimization/references/recommendation-deploy.md @@ -0,0 +1,41 @@ +# Deploy from ModelPackage + +Each recommendation includes a `ModelPackageArn` for direct deployment. + +Generate this notebook cell: + +```python +# Deploy the top recommendation +rec = resp["Recommendations"][0] +mp_arn = rec["ModelDetails"]["ModelPackageArn"] +spec_name = rec["ModelDetails"]["InferenceSpecificationName"] +instance_type = rec["DeploymentConfiguration"]["InstanceType"] + +# Create model from the specific inference specification +sm.create_model( + ModelName="my-optimized-model", + PrimaryContainer={ + "ModelPackageName": mp_arn, + "InferenceSpecificationName": spec_name, + }, + ExecutionRoleArn="", +) + +# Create endpoint config and endpoint +sm.create_endpoint_config( + EndpointConfigName="my-optimized-epc", + ProductionVariants=[{ + "VariantName": "AllTraffic", + "ModelName": "my-optimized-model", + "InstanceType": instance_type, + "InitialInstanceCount": 1, + }], +) + +sm.create_endpoint( + EndpointName="my-optimized-endpoint", + EndpointConfigName="my-optimized-epc", +) +``` + +**Important:** Always use `InferenceSpecificationName` to select the specific recommendation's configuration. Without it, SageMaker uses the primary InferenceSpecification (a copy of the first recommendation). diff --git a/plugins/sagemaker-ai/skills/ai-optimization/references/recommendation-options.md b/plugins/sagemaker-ai/skills/ai-optimization/references/recommendation-options.md new file mode 100644 index 0000000..430fdf6 --- /dev/null +++ b/plugins/sagemaker-ai/skills/ai-optimization/references/recommendation-options.md @@ -0,0 +1,74 @@ +# Recommendation Configuration Options + +## Instance Types (optional) + +For **latency** and **throughput** targets, the user can specify which instance types to evaluate: + +> "Would you like to specify which instance types to evaluate, or let the service choose automatically? +> +> Examples: `ml.g5.xlarge`, `ml.g6.12xlarge`, `ml.p5.48xlarge`" + +Note: **Cost** target does not support customer-specified instance types. + +## Optimization (optional) + +> "Should the service try to optimize the model? +> +> - **Kernel tuning** — Up to 30% improved TTFT and throughput +> - **Speculative decoding** — Up to 10x improved throughput +> +> **Heads up:** Without optimizations: 30–60 min. With: 2–10 hours. +> +> Default is yes (`OptimizeModel=True`). Set `OptimizeModel=False` to skip." + +### Dataset Requirement for Throughput + Optimization + +If **throughput** target AND `OptimizeModel=True`, a dataset is **required** for speculative decoding. Without it, the job fails with `ValidationError`. + +Ask: + +> "Since you chose throughput with optimizations, I need a dataset. Format: +> +> - **ShareGPT** — JSONL with `conversations` column: `[{"from": "human", "value": "..."}, ...]` +> - **OpenAI Chat** — JSONL with `messages` column: `[{"role": "user", "content": "..."}, ...]` +> - **OpenAI Completions** — JSONL with `prompt` column (optionally `completion`/`response`) +> - **File type:** `.jsonl` +> - **Location:** S3 prefix containing dataset files +> +> What's the S3 URI? (e.g., `s3://my-bucket/datasets/prompts/`)" + +⏸ Wait for user. + +Provide `DatasetConfig` on the **workload config** (not the recommendation job): + +```python +sm.create_ai_workload_config( + AIWorkloadConfigName=config_name, + AIWorkloadConfigs={"WorkloadSpec": {"Inline": json.dumps(workload_spec)}}, + DatasetConfig={ + "InputDataConfig": [{ + "ChannelName": "datasets", + "DataSource": { + "S3DataSource": { + "S3DataType": "S3Prefix", + "S3Uri": "", + } + }, + "ContentType": "application/jsonl", + }], + }, +) +``` + +## Workload Config + +Required. If one doesn't exist, create it per `benchmark-workflow.md` Step 2. + +## Inference Framework (optional) + +> "Which inference framework? +> +> - **VLLM** — High-performance serving with PagedAttention +> - **LMI** — SageMaker Large Model Inference container +> +> Default: auto-selected based on the model." diff --git a/plugins/sagemaker-ai/skills/ai-optimization/references/recommendation-workflow.md b/plugins/sagemaker-ai/skills/ai-optimization/references/recommendation-workflow.md new file mode 100644 index 0000000..9d36d2a --- /dev/null +++ b/plugins/sagemaker-ai/skills/ai-optimization/references/recommendation-workflow.md @@ -0,0 +1,100 @@ +# Recommendation Workflow + +Guide the user through creating and running an AI Recommendation Job. + +## Step 1: Gather Model Information + +You need: + +- **Model S3 URI** — S3 path to the model weights (HuggingFace format) +- **IAM Role ARN** — Execution role with SageMaker and S3 permissions + +If not already known, ask: + +> "I need two things to get started: +> +> 1. **Model location** — The S3 URI where your model weights are stored (e.g., `s3://my-bucket/models/llama-3-8b/`) +> 2. **IAM Role** — An execution role ARN with SageMaker permissions +> +> The model should be in HuggingFace format (config.json + model weights)." + +⏸ Wait for user. + +## Step 2: Choose a Performance Target + +| Target | Metric | What It Optimizes | +| -------------- | ------------ | ------------------------------------------------------------------- | +| **Cost** | `cost` | Lowest cost per hour while meeting baseline performance | +| **Latency** | `ttft-ms` | Lowest time-to-first-token (with optional stat: p50, p90, p95, p99) | +| **Throughput** | `throughput` | Highest output tokens per second | + +Ask the user: + +> "What's most important for your use case? +> +> 1. **Cost** — Find the cheapest instance that meets performance requirements +> 2. **Latency** — Minimize time-to-first-token (best for interactive/chat) +> 3. **Throughput** — Maximize tokens per second (best for batch processing)" + +⏸ Wait for user. + +## Step 3: Configure Options + +For instance types, optimization, dataset, workload config, and inference framework options, read `recommendation-options.md`. + +## Step 4: Create the Recommendation Job + +Generate a notebook that creates and monitors the job: + +```python +import boto3 +import json +import time + +sm = boto3.client("sagemaker") + +config_name = "my-rec-workload-config" +job_name = "my-recommendation-job" +sm.create_ai_recommendation_job( + AIRecommendationJobName=job_name, + ModelSource={"S3": {"S3Uri": ""}}, + OutputConfig={"S3OutputLocation": "s3:///rec-output/"}, + RoleArn="", + AIWorkloadConfigIdentifier=config_name, + PerformanceTarget={"Constraints": [{"Metric": "cost"}]}, + # ComputeSpec={"InstanceTypes": ["ml.g6.12xlarge"]}, + # OptimizeModel=False, + # InferenceSpecification={"Framework": "VLLM"}, +) + +# Poll until complete (timeout after 2 hours — optimization jobs can take longer) +MAX_WAIT = 7200 +start = time.time() +while time.time() - start < MAX_WAIT: + resp = sm.describe_ai_recommendation_job(AIRecommendationJobName=job_name) + status = resp["AIRecommendationJobStatus"] + print(f"Status: {status} ({int(time.time() - start)}s elapsed)") + if status in ("Completed", "Failed", "Stopped"): + break + time.sleep(60) +else: + raise TimeoutError("Recommendation job did not complete within 2 hours") + +if status == "Failed": + print(f"Failed: {resp.get('FailureReason')}") +else: + print(f"Completed with {len(resp.get('Recommendations', []))} recommendations") +``` + +## Step 5: Present Recommendations + +The `DescribeAIRecommendationJob` response contains: + +- **Recommendations** — Ranked list with `DeploymentConfiguration`, `ExpectedPerformance`, `OptimizationDetails`, `ModelDetails` +- **OutputConfig** — S3 location and `ModelPackageGroupIdentifier` + +Return to the main SKILL.md Step 3 (Review Results). + +## Step 6: Deploy from ModelPackage + +Read `recommendation-deploy.md` for deployment code.