diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/README.md b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/README.md new file mode 100644 index 0000000..b4df2db --- /dev/null +++ b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/README.md @@ -0,0 +1,428 @@ +# [Startup_Demo](../../../)/[GenAI](../../)/[Cloud AI Playground](../)/[vLLM QAIC Concurrency Benchmark on AIC100](./) +# vLLM QAIC Concurrency Benchmark on AIC100 + +## 📘Table of Contents +- [🧭Overview](#1overview) +- [✨Features](#2features) +- [🐳Requirements](#3requirements) +- [⚙️Environment Setup](#4️environment-setup) +- [📊Benchmark Methodology](#5benchmark-methodology) +- [📦Benchmark Outputs](#6benchmark-outputs) +- [🚀Demo](#7demo) +- [✅Summary](#summary) +- [🔧Customization](#customization) +--- +## 1.🧭Overview + +This project benchmarks Large Language Model (LLM) serving performance on Qualcomm AIC100 using vLLM. + +The benchmark evaluates: +- System-level throughput +- Concurrency scaling behavior +- Decode workload efficiency +- System limits under heavy load + +The purpose is to provide a **comprehensive evaluation of QAIC hardware performance for LLM inference workloads**, including both optimal operating regions and failure conditions. + +```mermaid +flowchart LR + Client["Benchmark Script (host machine)"] + + subgraph Servers["vLLM Endpoints (4x)"] + S1[":8000\nDevice 0"] + S2[":8001\nDevice 1"] + S3[":8002\nDevice 2"] + S4[":8003\nDevice 3"] + end + + D0["QAIC accel0"] + D1["QAIC accel1"] + D2["QAIC accel2"] + D3["QAIC accel3"] + + Client -->|Concurrent Requests| S1 + Client -->|Concurrent Requests| S2 + Client -->|Concurrent Requests| S3 + Client -->|Concurrent Requests| S4 + + S1 --> D0 + S2 --> D1 + S3 --> D2 + S4 --> D3 + + D0 -->|Tokens| S1 + D1 -->|Tokens| S2 + D2 -->|Tokens| S3 + D3 -->|Tokens| S4 + + S1 -->|Response| Client + S2 -->|Response| Client + S3 -->|Response| Client + S4 -->|Response| Client +``` +Each vLLM endpoint is pinned to a dedicated QAIC device, enabling independent inference pipelines and predictable scaling behavior. + + +--- +## 2.✨Features + +This benchmark provides the following analysis capabilities: + +### 🔹 System Throughput Characterization + +Measures total system throughput across multiple vLLM endpoints: +- Requests per second (TPS) +- Tokens per second (tokens/s) + +Allows identification of: +- Peak performance +- Throughput saturation point + +### 🔹 Concurrency Scaling Analysis +Evaluates system behavior as concurrency increases: +> concurrency = 1 → 64 + +Captures: +- Linear scaling region +- Saturation region +- Performance degradation + +### 🔹 Decode Workload Sensitivity +Analyzes performance under different generation lengths: +> max_tokens = [32, 128, 512] + +Reveals: +- Overhead-bound vs compute-bound regimes +- Efficiency trade-offs across workloads + +### 🔹 Multi-Endpoint Load Distribution +Distributes load across 4 vLLM endpoints: +> worker_id % 4 + +Verifies: +- Balanced hardware utilization +- Absence of bottleneck devices + +### 🔹 Failure and Stability Detection +Identifies system limits under stress and failure boundaries: +- Request failures +- Timeout conditions +- Throughput collapse + +--- +## 3.🐳Requirements + +This section describes both the hardware platform (QAIC) and the software environment used for benchmarking. + + +### 3.1 AIC100 Hardware Platform (QAIC) + +The Qualcomm AI Cloud Inference (QAIC) AIC100 is a dedicated AI inference accelerator designed for: + +✅ Large-scale AI serving +✅ Energy-efficient inference +✅ On-prem deployment + +#### 🔸 Architectural Characteristics + +| Feature | Description | +|---------|-------------| +| Compute type | Dedicated inference accelerator | +| Execution model | Optimized for continuous inference | +| Parallelism | Multi-device scaling | +| Memory model | KV cache intensive | +| Workload target | Sustained decode workloads | + +#### 🔸 Hardware Specification + +| Component | Description | +|----------|------------| +| Accelerator | Qualcomm Cloud AI 100 Ultra (AIC100 Ultra) | +| Device Interface | PCIe-based accelerator | +| Device Nodes | `/dev/accel/accel*` | +| Deployment Mode | Multi-device (4 cards in this benchmark) | +| Target Workload | LLM inference (Generative AI workloads) | + +Reference: [Cloud AI 100 Ultra Overview](https://www.qualcomm.com/artificial-intelligence/data-center/cloud-ai-100-ultra#Overview) + +### 3.2 Software Environment + +- Ubuntu Linux host +- Python3 +- Docker +- Qualcomm Cloud AI SDK (Platform & Apps) v1.21.4 +- vLLM (OpenAI-compatible serving) + +> ✅ Before you begin following this guide, you need to pre‑install the [Qualcomm Cloud AI SDK](https://quic.github.io/cloud-ai-sdk-pages/1.21/Getting-Started/Installation/sdk-installation.html) on the AIC100. + +--- +## 4.⚙️Environment Setup + +### 4.1 Download the Docker Image + +To enable multi-model server endpoints, you need to install the Docker image from the [Cloud AI Containers](https://github.com/quic/cloud-ai-containers/pkgs/container/cloud_ai_inference_ubuntu22). + +Use the following command to download the image: +```bash +docker pull ghcr.io/quic/cloud_ai_inference_ubuntu22:1.21.4.0 +``` + +Verify that the image was downloaded successfully: +```bash +docker images +``` +If successful, the repository should appear in the list as shown below: +![N|Solid](./images/docker-images-show.png) + +💡*This sample uses the Docker image version cloud_ai_inference_ubuntu22:1.21.4.0.* + +### 4.2 Verify Available QAIC Devices + +Before creating containers, verify the available QAIC devices using: +```bash +sudo /opt/qti-aic/tools/qaic-util -t 1 +``` +![N|Solid](./images/qaic-devices-status.png) +💡*To reproduce this multi-model server setup, four QAIC devices are required.* + +### 4.3 Create Containers for Server Endpoints + +In this sample, four LLM models are deployed, so four containers are required—one per server endpoint. + +``` bash +docker run -dit --name vllm-aic100-s1-dg0 --device=/dev/accel/accel0 -v /home/qitc/:/home/qitc/ -p 8000:8000 ghcr.io/quic/cloud_ai_inference_ubuntu22:1.21.4.0 + +docker run -dit --name vllm-aic100-s2-dg1 --device=/dev/accel/accel1 -v /home/qitc/:/home/qitc/ -p 8001:8000 ghcr.io/quic/cloud_ai_inference_ubuntu22:1.21.4.0 + +docker run -dit --name vllm-aic100-s3-dg2 --device=/dev/accel/accel2 -v /home/qitc/:/home/qitc/ -p 8002:8000 ghcr.io/quic/cloud_ai_inference_ubuntu22:1.21.4.0 + +docker run -dit --name vllm-aic100-s4-dg3 --device=/dev/accel/accel3 -v /home/qitc/:/home/qitc/ -p 8003:8000 ghcr.io/quic/cloud_ai_inference_ubuntu22:1.21.4.0 +``` +Each container maps to one device and one port for serving an independent LLM endpoint. + +This configuration ensures that each vLLM instance is pinned to a dedicated QAIC device, enabling isolated and predictable performance measurement. + +### 4.4 Launching the Server + +After creating the containers, all required dependencies are pre-installed. +Only the model download and server initialization are needed. + +### 🔹 Hugging Face Setup + +Enter each container: +```bash +docker exec -it vllm-aic100-s1-dg0 /bin/bash + +docker exec -it vllm-aic100-s2-dg1 /bin/bash + +docker exec -it vllm-aic100-s3-dg2 /bin/bash + +docker exec -it vllm-aic100-s4-dg3 /bin/bash +``` + +Activate the pre-configured virtual environment (in each container): +```bash +source /opt/vllm-env/bin/activate +``` + +Login to Hugging Face (required for model access): +```bash +huggingface-cli login +``` +![N|Solid](./images/huggingface_cli_login.png) + +💡*Note: If you don't have a token, sign in and request access from the model page: [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).* + +### 🔹 Start the vLLM Server + +Run the following command in each container to launch the server: +```bash +python3 -m vllm.entrypoints.openai.api_server \ +--host 0.0.0.0 \ +--port 8000 \ +--device-group 0 \ +--model meta-llama/Llama-3.2-3B-Instruct \ +--max-model-len 1024 \ +--block-size 16 \ +--quantization mxfp6 \ +--kv-cache-dtype auto \ +--disable-sliding-window \ +--max-num-seqs 32 +``` +If the server starts successfully, you should see logs similar to: +![N|Solid](./images/server-start-log.png) + +--- +## 5.📊Benchmark Methodology + +### 🔹 Test Configuration + +| Parameter | Value | +|-----------|-------| +| Model | meta-llama/Llama-3.2-3B-Instruct | +| Prompt | "Explain what AI is in simple terms." | +| max_tokens | 32 / 128 / 512 | +| Concurrency | 1 → 64 | +| Duration | 20 sec | +| Warmup | 3 sec | +| Repeats | 3 | + +### 🔹 Metrics + +✅ Throughput +> TPS = number of successful requests completed per second +- Calculated over the measurement window (excluding warmup) +- Only successful responses are counted + +✅ Tokens per Second (tokens/s) +> tokens/s = total completion tokens generated per second + +In this benchmark: +- Most requests generate exactly `max_tokens` +- Therefore, tokens/s ≈ TPS × max_tokens + +This results in near-linear scaling between TPS and tokens/s. + +✅ Latency +- Average latency: mean end-to-end time per successful request +- Maximum latency: worst-case request latency observed + +Includes: +- Network overhead +- Scheduling delay +- Model inference time + +✅ Per-endpoint Metrics +- TPS per endpoint (`ep0 ~ ep3`) +- Tokens/sec per endpoint + +Used to evaluate: +- Load distribution across QAIC devices +- System balance and hardware utilization + +✅ Failure Rate +- Counts failed requests per endpoint and globally +- Includes: + - Timeout failures + - Unsuccessful responses + +A high failure rate typically indicates: +- System saturation +- Excessive queueing delay +- Requests exceeding timeout limits under high concurrency + + +💡*Note: In this benchmark, tokens/sec scales approximately linearly with TPS under fixed max_tokens settings, indicating a stable and deterministic workload.* + +--- +## 6.📦Benchmark Outputs + +This benchmark generates structured outputs for performance analysis and visualization. + +These include raw metrics, throughput plots, latency plots, and multi-workload comparisons. + + +### 🔹 Result Files + +| Category | Files | Purpose | +|---------|------|--------| +| Raw Metrics | `multi_results_tok*.csv` | Stores aggregated throughput, latency, and failure metrics | +| Throughput Plots | `multi_tps_tok*.png` | Visualizes TPS scaling behavior | +| Latency Plots | `multi_latency_tok*.png` | Visualizes latency trends across concurrency | +| Per-endpoint Metrics | `per_endpoint_tps_tok*.png` | Shows load distribution across endpoints | +| Comparative Analysis | `compare_*multi_tokens*.png` | Compares different workload behaviors | + +All outputs are stored in: +``` +vLLM_QAIC_Concurrency_Benchmark_on_AIC100/results/ +``` + +--- +## 7.🚀Demo + +This section demonstrates the performance of QAIC under a representative workload: + +```code +max_tokens = 128 (balanced workload) +``` + +This setting provides the best trade-off between throughput, latency, and system stability. + +### 🔹 Run the Benchmark + + +Execute the following command on the host machine (outside the container, e.g., AIC100 server). +```bash +python3 ./benchmark_multi_server_full_multitokens.py +``` +💡*Note: This benchmark script runs outside the container and sends requests to the vLLM endpoints exposed by each container.* +## 🔹 Key Results (max_tokens=128) + +| Concurrency | TPS (req/s) | Tokens/s | Avg Latency (s) | +|------------|------------|----------|----------------| +| 8 | 1.20 | 153.60 | 9.31 | +| 16 | 2.40 | 307.20 | 10.86 | +| 24 | 2.40 | 307.20 | 12.30 | +| 32 | 3.20 | 409.60 | 13.71 | +| 40 | 4.00 | 512.00 | 16.60 | +| 48 | 4.80 | 614.40 | 17.93 | +| 56 | 5.60 | 716.80 | 19.29 | +| 64 | 6.40 | 819.20✅ | 20.61 | + + +## 🔹 Throughput Scaling + +The system demonstrates steady throughput scaling across all tested concurrency levels, with no significant degradation observed up to concurrency = 64. + +Throughput continues to increase alongside concurrency, while latency grows gradually, indicating a shift toward a latency-bound regime at higher concurrency levels. + +![TPS vs Concurrency](images/multi_tps_tok128.png) + +## 🔹 Per-Endpoint Load Distribution + +![Per-endpoint TPS](images/per_endpoint_tps_tok128.png) + +Each endpoint shows nearly identical throughput across all tested concurrency levels, indicating: + +- ✅ Balanced load distribution across QAIC devices +- ✅ No single-device bottleneck +- ✅ Stable multi-device scaling behavior + +This confirms that the benchmarking setup effectively utilizes all available hardware resources. + + +--- + +## ✅Summary + +This benchmark demonstrates the performance characteristics of QAIC under different concurrency levels and decode workloads. + +Key findings include: + +- ✅ Throughput increases steadily across the entire concurrency range (1 → 64) +- ✅ Peak performance of ~819 tokens/sec under `max_tokens = 128` +- ✅ No obvious throughput collapse observed within tested range +- ✅ Latency increases gradually with concurrency +- ✅ Stable and balanced utilization across all QAIC devices + + +💡 *Note: This benchmark is performed using: `meta-llama/Llama-3.2-3B-Instruct`. +Results may vary if different models (e.g., Qwen) are used due to architectural and decoding differences.* + +--- + +## 🔧Customization + +This benchmark can be easily adapted to evaluate different models or configurations. + +To run tests with your own model, modify the following parameters in the script: + +- `MODEL_NAME` +- `MAX_TOKENS_LIST` +- `CONCURRENCY_LEVELS` + +For this example, the benchmark was performed using: +```python +meta-llama/Llama-3.2-3B-Instruct +``` +Users can replace the model with any supported LLM (e.g., other Hugging Face models) to evaluate performance under different workloads. \ No newline at end of file diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/benchmark_multi_server_full_multitokens.py b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/benchmark_multi_server_full_multitokens.py new file mode 100644 index 0000000..7120801 --- /dev/null +++ b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/benchmark_multi_server_full_multitokens.py @@ -0,0 +1,405 @@ +#===-- benchmark_multi_server_full_multitokens.py ------------------------===// +# Part of the Startup-Demos Project, under the MIT License +# See https://github.com/qualcomm/Startup-Demos/blob/main/LICENSE.txt +# for license information. +# Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries. +# SPDX-License-Identifier: MIT License +#===----------------------------------------------------------------------===// + +import asyncio +import aiohttp +import time +import csv +import os +from datetime import datetime +import matplotlib.pyplot as plt + +# ========================= +# Config +# ========================= +ENDPOINTS = [ + "http://localhost:8000/v1/chat/completions", + "http://localhost:8001/v1/chat/completions", + "http://localhost:8002/v1/chat/completions", + "http://localhost:8003/v1/chat/completions", +] + +MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct" +PROMPT = "Explain what AI is in simple terms." +TEMPERATURE = 0.7 + +# Sweep decode workload +MAX_TOKENS_LIST = [32, 128, 512] + +# Sweep concurrency +CONCURRENCY_LEVELS = [1, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64] + +# Measurement windows +DURATION_S = 20 +WARMUP_S = 3 +REPEATS_PER_LEVEL = 3 +REQUEST_TIMEOUT_S = 60 + +# Metrics +ENABLE_TOKEN_METRICS = True + +# Output +RUN_TAG = datetime.now().strftime("%Y%m%d_%H%M%S") +OUT_DIR = "results" + + +# ========================= +# Payload builder +# ========================= +def build_payload(max_tokens: int): + return { + "model": MODEL_NAME, + "messages": [{"role": "user", "content": PROMPT}], + "max_tokens": max_tokens, + "temperature": TEMPERATURE, + } + + +# ========================= +# One request (E2E latency) +# ========================= +async def send_one(session, url, payload): + start = time.perf_counter() + try: + async with session.post(url, json=payload, timeout=REQUEST_TIMEOUT_S) as resp: + data = await resp.json() + elapsed = time.perf_counter() - start + + if resp.status < 200 or resp.status >= 300: + return False, elapsed, None + + completion_tokens = None + if ENABLE_TOKEN_METRICS: + usage = data.get("usage", {}) + if isinstance(usage, dict): + completion_tokens = usage.get("completion_tokens", None) + + return True, elapsed, completion_tokens + + except Exception: + return False, None, None + + +# ========================= +# Worker: fixed endpoint pinning (worker_id % N) +# ========================= +async def worker(worker_id, session, end_t, warmup_end_t, per_worker_stats, payload): + n = len(ENDPOINTS) + url = ENDPOINTS[worker_id % n] + ep_idx = worker_id % n + + local_success = 0 + local_fail = 0 + local_lat = [] + local_tokens = 0 + + while True: + now = time.perf_counter() + if now >= end_t: + break + + ok, latency, comp_tokens = await send_one(session, url, payload) + done_t = time.perf_counter() + + # Use completion time for warmup filtering + if done_t < warmup_end_t: + continue + + if ok: + local_success += 1 + if latency is not None: + local_lat.append(latency) + if comp_tokens is not None: + local_tokens += comp_tokens + else: + local_fail += 1 + + per_worker_stats.append((ep_idx, local_success, local_fail, local_lat, local_tokens)) + + +# ========================= +# Run one level (one concurrency, one max_tokens) for fixed duration +# ========================= +async def run_level(concurrency, max_tokens): + payload = build_payload(max_tokens) + + timeout = aiohttp.ClientTimeout(total=REQUEST_TIMEOUT_S + 5) + async with aiohttp.ClientSession(timeout=timeout) as session: + t0 = time.perf_counter() + warmup_end = t0 + WARMUP_S + end_t = t0 + WARMUP_S + DURATION_S + + per_worker_stats = [] + tasks = [ + worker(i, session, end_t, warmup_end, per_worker_stats, payload) + for i in range(concurrency) + ] + await asyncio.gather(*tasks) + + # Aggregate per-endpoint + n = len(ENDPOINTS) + ep_success = [0] * n + ep_fail = [0] * n + ep_latencies = [[] for _ in range(n)] + ep_tokens = [0] * n + + for ep_idx, s, f, lats, toks in per_worker_stats: + ep_success[ep_idx] += s + ep_fail[ep_idx] += f + ep_latencies[ep_idx].extend(lats) + ep_tokens[ep_idx] += toks + + total_success = sum(ep_success) + total_fail = sum(ep_fail) + all_lat = [x for sub in ep_latencies for x in sub] + + # Throughput over measured window (exclude warmup) + tps = total_success / DURATION_S if DURATION_S > 0 else 0.0 + tok_per_s = (sum(ep_tokens) / DURATION_S) if (ENABLE_TOKEN_METRICS and DURATION_S > 0) else None + + avg_lat = (sum(all_lat) / len(all_lat)) if all_lat else None + max_lat = max(all_lat) if all_lat else None + + ep_tps = [s / DURATION_S for s in ep_success] + ep_tokps = [(t / DURATION_S) for t in ep_tokens] if ENABLE_TOKEN_METRICS else [None] * n + ep_avg_lat = [(sum(l)/len(l) if l else None) for l in ep_latencies] + + return { + "concurrency": concurrency, + "max_tokens": max_tokens, + + "total_success": total_success, + "total_fail": total_fail, + + "tps_req_per_s": tps, + "tok_per_s": tok_per_s, + + "avg_latency_s": avg_lat, + "max_latency_s": max_lat, + + "ep_success": ep_success, + "ep_fail": ep_fail, + "ep_tps": ep_tps, + "ep_tokps": ep_tokps, + "ep_avg_lat": ep_avg_lat, + } + + +# ========================= +# Repeat & aggregate +# ========================= +def mean(vals): + vals = [v for v in vals if v is not None] + return sum(vals) / len(vals) if vals else None + + +async def run_repeats(concurrency, repeats, max_tokens): + runs = [] + for _ in range(repeats): + runs.append(await run_level(concurrency, max_tokens)) + return runs + + +def aggregate_runs(runs): + n = len(ENDPOINTS) + return { + "concurrency": runs[0]["concurrency"], + "max_tokens": runs[0]["max_tokens"], + + "tps_req_per_s_mean": mean([r["tps_req_per_s"] for r in runs]), + "tok_per_s_mean": mean([r["tok_per_s"] for r in runs]), + "avg_latency_s_mean": mean([r["avg_latency_s"] for r in runs]), + "max_latency_s_mean": mean([r["max_latency_s"] for r in runs]), + + **{f"ep{i}_tps_mean": mean([r["ep_tps"][i] for r in runs]) for i in range(n)}, + **{f"ep{i}_tokps_mean": mean([r["ep_tokps"][i] for r in runs]) for i in range(n)}, + **{f"ep{i}_avg_lat_mean": mean([r["ep_avg_lat"][i] for r in runs]) for i in range(n)}, + **{f"ep{i}_fail_mean": mean([r["ep_fail"][i] for r in runs]) for i in range(n)}, + + "total_fail_mean": mean([r["total_fail"] for r in runs]), + "total_success_mean": mean([r["total_success"] for r in runs]), + } + + +# ========================= +# IO & plotting +# ========================= +def safe_format(x, digits=2): + return f"{x:.{digits}f}" if x is not None else "N/A" + + +def save_csv(rows, path): + os.makedirs(os.path.dirname(path), exist_ok=True) + with open(path, "w", newline="", encoding="utf-8") as f: + w = csv.DictWriter(f, fieldnames=list(rows[0].keys())) + w.writeheader() + w.writerows(rows) + + +def plot_overall_tps(rows, out_png, title_suffix=""): + conc = [r["concurrency"] for r in rows] + tps = [r["tps_req_per_s_mean"] for r in rows] + plt.figure() + plt.plot(conc, tps, marker="o") + plt.xlabel("Concurrency (total across 4 servers)") + plt.ylabel("TPS (requests/sec)") + plt.title(f"Aggregate TPS vs Concurrency (4 servers) - {MODEL_NAME}{title_suffix}") + plt.grid(True, linestyle="--", alpha=0.6) + os.makedirs(os.path.dirname(out_png), exist_ok=True) + plt.tight_layout(rect=[0, 0, 1, 0.95]) + plt.savefig(out_png, dpi=150, bbox_inches="tight", pad_inches=0.2) + plt.close() + + +def plot_latency(rows, out_png, title_suffix=""): + conc = [r["concurrency"] for r in rows] + avg_lat = [r["avg_latency_s_mean"] for r in rows] + max_lat = [r["max_latency_s_mean"] for r in rows] + plt.figure() + plt.plot(conc, avg_lat, marker="o", label="Avg latency") + plt.plot(conc, max_lat, marker="x", label="Max latency") + plt.xlabel("Concurrency (total across 4 servers)") + plt.ylabel("Latency (sec)") + plt.title(f"Latency vs Concurrency (4 servers) - {MODEL_NAME}{title_suffix}") + plt.legend() + plt.grid(True, linestyle="--", alpha=0.6) + os.makedirs(os.path.dirname(out_png), exist_ok=True) + plt.tight_layout(rect=[0, 0, 1, 0.95]) + plt.savefig(out_png, dpi=150, bbox_inches="tight", pad_inches=0.2) + plt.close() + + +def plot_per_endpoint_tps(rows, out_png, title_suffix="", jitter_x=True): + """ + jitter_x=True will shift x slightly per endpoint so overlapping lines are visible. + """ + conc = [r["concurrency"] for r in rows] + plt.figure() + + offsets = [-0.30, -0.10, 0.10, 0.30] + styles = ["-", "--", "-.", ":"] + + for i in range(len(ENDPOINTS)): + tps_i = [r[f"ep{i}_tps_mean"] for r in rows] + x_i = [(c + offsets[i]) for c in conc] if jitter_x else conc + plt.plot(x_i, tps_i, marker="o", linestyle=styles[i], linewidth=2, label=f"endpoint {i}") + + plt.xlabel("Concurrency (total across 4 servers)") + plt.ylabel("TPS per endpoint (req/s)") + plt.title(f"Per-endpoint TPS vs Concurrency - {MODEL_NAME}{title_suffix}") + plt.legend() + plt.grid(True, linestyle="--", alpha=0.6) + os.makedirs(os.path.dirname(out_png), exist_ok=True) + plt.tight_layout(rect=[0, 0, 1, 0.95]) + plt.savefig(out_png, dpi=150, bbox_inches="tight", pad_inches=0.2) + plt.close() + + +def plot_multi_tokens_curves(all_rows_by_tokens, out_png, y_key, y_label, title): + """ + Draw multiple lines (one per max_tokens) on the same chart. + y_key: e.g. "tps_req_per_s_mean" or "tok_per_s_mean" + """ + plt.figure() + for max_tokens, rows in all_rows_by_tokens.items(): + conc = [r["concurrency"] for r in rows] + y = [r[y_key] for r in rows] + plt.plot(conc, y, marker="o", label=f"max_tokens={max_tokens}") + + plt.xlabel("Concurrency (total across 4 servers)") + plt.ylabel(y_label) + plt.title(title) + plt.legend() + plt.grid(True, linestyle="--", alpha=0.6) + os.makedirs(os.path.dirname(out_png), exist_ok=True) + plt.tight_layout(rect=[0, 0, 1, 0.95]) + plt.savefig(out_png, dpi=150, bbox_inches="tight", pad_inches=0.2) + plt.close() + + +# ========================= +# Main +# ========================= +async def main(): + print("Endpoints:") + for e in ENDPOINTS: + print(" ", e) + print(f"\nModel={MODEL_NAME}, duration={DURATION_S}s, warmup={WARMUP_S}s, repeats={REPEATS_PER_LEVEL}") + print(f"MAX_TOKENS_LIST={MAX_TOKENS_LIST}\n") + + os.makedirs(OUT_DIR, exist_ok=True) + + # Collect results for final multi-curve plots + all_rows_by_tokens = {} + + for max_tokens in MAX_TOKENS_LIST: + print("\n==============================") + print(f" MAX_TOKENS = {max_tokens}") + print("==============================\n") + + agg_rows = [] + + for c in CONCURRENCY_LEVELS: + runs = await run_repeats(c, REPEATS_PER_LEVEL, max_tokens) + agg = aggregate_runs(runs) + agg_rows.append(agg) + + print(f"=== Concurrency {c} ===") + print(f"Aggregate TPS(mean): {safe_format(agg['tps_req_per_s_mean'])} req/s") + print(f"Tokens/s(mean): {safe_format(agg['tok_per_s_mean'])}") + print(f"Avg latency(mean): {safe_format(agg['avg_latency_s_mean'])} s") + print(f"Max latency(mean): {safe_format(agg['max_latency_s_mean'])} s") + print(f"Total fail(mean): {safe_format(agg['total_fail_mean'])}") + print("") + + all_rows_by_tokens[max_tokens] = agg_rows + + # Per max_tokens outputs + csv_path = f"{OUT_DIR}/multi_results_tok{max_tokens}_{RUN_TAG}.csv" + tps_png = f"{OUT_DIR}/multi_tps_tok{max_tokens}_{RUN_TAG}.png" + lat_png = f"{OUT_DIR}/multi_latency_tok{max_tokens}_{RUN_TAG}.png" + per_png = f"{OUT_DIR}/per_endpoint_tps_tok{max_tokens}_{RUN_TAG}.png" + + save_csv(agg_rows, csv_path) + plot_overall_tps(agg_rows, tps_png, title_suffix=f" (max_tokens={max_tokens})") + plot_latency(agg_rows, lat_png, title_suffix=f" (max_tokens={max_tokens})") + plot_per_endpoint_tps(agg_rows, per_png, title_suffix=f" (max_tokens={max_tokens})", jitter_x=True) + + print("Saved:") + print(" CSV :", csv_path) + print(" TPS :", tps_png) + print(" LAT :", lat_png) + print(" PER :", per_png) + + # Final multi-curve summary plots (compare max_tokens) + multi_tps_png = f"{OUT_DIR}/compare_tps_multi_tokens_{RUN_TAG}.png" + multi_tokps_png = f"{OUT_DIR}/compare_tokps_multi_tokens_{RUN_TAG}.png" + + plot_multi_tokens_curves( + all_rows_by_tokens, + multi_tps_png, + y_key="tps_req_per_s_mean", + y_label="TPS (requests/sec)", + title=f"Aggregate TPS vs Concurrency (compare max_tokens) - {MODEL_NAME}" + ) + + plot_multi_tokens_curves( + all_rows_by_tokens, + multi_tokps_png, + y_key="tok_per_s_mean", + y_label="Tokens/sec", + title=f"Tokens/sec vs Concurrency (compare max_tokens) - {MODEL_NAME}" + ) + + print("\nSaved multi-curve comparison plots:") + print(" TPS compare :", multi_tps_png) + print(" Tok/s compare:", multi_tokps_png) + + +if __name__ == "__main__": + asyncio.run(main()) \ No newline at end of file diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/docker-images-show.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/docker-images-show.png new file mode 100644 index 0000000..d9c285c Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/docker-images-show.png differ diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/huggingface_cli_login.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/huggingface_cli_login.png new file mode 100644 index 0000000..5ec0507 Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/huggingface_cli_login.png differ diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/multi_tps_tok128.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/multi_tps_tok128.png new file mode 100644 index 0000000..587a424 Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/multi_tps_tok128.png differ diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/per_endpoint_tps_tok128.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/per_endpoint_tps_tok128.png new file mode 100644 index 0000000..c2f6c36 Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/per_endpoint_tps_tok128.png differ diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/qaic-devices-status.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/qaic-devices-status.png new file mode 100644 index 0000000..1e4bc7d Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/qaic-devices-status.png differ diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/server-start-log.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/server-start-log.png new file mode 100644 index 0000000..a02beb5 Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/server-start-log.png differ diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/project.yaml b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/project.yaml new file mode 100644 index 0000000..18eb795 --- /dev/null +++ b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/project.yaml @@ -0,0 +1,11 @@ +name: "vLLM QAIC Concurrency Benchmark on AIC100" +description: "Provides a reference benchmark for evaluating vLLM serving performance on Qualcomm AIC100 (QAIC), focusing on concurrency scaling, throughput, and multi-request LLM inference behavior." +category: + - "GenAI" + - "CloudAI-Playground" +platforms: + - "AIC100 Ultra" +tags: + - "vLLM" + - "LLM Benchmark" + - "QAIC" \ No newline at end of file diff --git a/README.md b/README.md index ebc5ec6..6f68208 100644 --- a/README.md +++ b/README.md @@ -191,6 +191,9 @@ This repository is a collection of demo applications that highlight the capabili
Traffic AI Agent – Dual Intersection (On‑Prem, Analysis) +
+ + vLLM QAIC Concurrency Benchmark on AIC100