diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/README.md b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/README.md
new file mode 100644
index 0000000..b4df2db
--- /dev/null
+++ b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/README.md
@@ -0,0 +1,428 @@
+# [Startup_Demo](../../../)/[GenAI](../../)/[Cloud AI Playground](../)/[vLLM QAIC Concurrency Benchmark on AIC100](./)
+# vLLM QAIC Concurrency Benchmark on AIC100
+
+## 📘Table of Contents
+- [🧭Overview](#1overview)
+- [✨Features](#2features)
+- [🐳Requirements](#3requirements)
+- [⚙️Environment Setup](#4️environment-setup)
+- [📊Benchmark Methodology](#5benchmark-methodology)
+- [📦Benchmark Outputs](#6benchmark-outputs)
+- [🚀Demo](#7demo)
+- [✅Summary](#summary)
+- [🔧Customization](#customization)
+---
+## 1.🧭Overview
+
+This project benchmarks Large Language Model (LLM) serving performance on Qualcomm AIC100 using vLLM.
+
+The benchmark evaluates:
+- System-level throughput
+- Concurrency scaling behavior
+- Decode workload efficiency
+- System limits under heavy load
+
+The purpose is to provide a **comprehensive evaluation of QAIC hardware performance for LLM inference workloads**, including both optimal operating regions and failure conditions.
+
+```mermaid
+flowchart LR
+ Client["Benchmark Script (host machine)"]
+
+ subgraph Servers["vLLM Endpoints (4x)"]
+ S1[":8000\nDevice 0"]
+ S2[":8001\nDevice 1"]
+ S3[":8002\nDevice 2"]
+ S4[":8003\nDevice 3"]
+ end
+
+ D0["QAIC accel0"]
+ D1["QAIC accel1"]
+ D2["QAIC accel2"]
+ D3["QAIC accel3"]
+
+ Client -->|Concurrent Requests| S1
+ Client -->|Concurrent Requests| S2
+ Client -->|Concurrent Requests| S3
+ Client -->|Concurrent Requests| S4
+
+ S1 --> D0
+ S2 --> D1
+ S3 --> D2
+ S4 --> D3
+
+ D0 -->|Tokens| S1
+ D1 -->|Tokens| S2
+ D2 -->|Tokens| S3
+ D3 -->|Tokens| S4
+
+ S1 -->|Response| Client
+ S2 -->|Response| Client
+ S3 -->|Response| Client
+ S4 -->|Response| Client
+```
+Each vLLM endpoint is pinned to a dedicated QAIC device, enabling independent inference pipelines and predictable scaling behavior.
+
+
+---
+## 2.✨Features
+
+This benchmark provides the following analysis capabilities:
+
+### 🔹 System Throughput Characterization
+
+Measures total system throughput across multiple vLLM endpoints:
+- Requests per second (TPS)
+- Tokens per second (tokens/s)
+
+Allows identification of:
+- Peak performance
+- Throughput saturation point
+
+### 🔹 Concurrency Scaling Analysis
+Evaluates system behavior as concurrency increases:
+> concurrency = 1 → 64
+
+Captures:
+- Linear scaling region
+- Saturation region
+- Performance degradation
+
+### 🔹 Decode Workload Sensitivity
+Analyzes performance under different generation lengths:
+> max_tokens = [32, 128, 512]
+
+Reveals:
+- Overhead-bound vs compute-bound regimes
+- Efficiency trade-offs across workloads
+
+### 🔹 Multi-Endpoint Load Distribution
+Distributes load across 4 vLLM endpoints:
+> worker_id % 4
+
+Verifies:
+- Balanced hardware utilization
+- Absence of bottleneck devices
+
+### 🔹 Failure and Stability Detection
+Identifies system limits under stress and failure boundaries:
+- Request failures
+- Timeout conditions
+- Throughput collapse
+
+---
+## 3.🐳Requirements
+
+This section describes both the hardware platform (QAIC) and the software environment used for benchmarking.
+
+
+### 3.1 AIC100 Hardware Platform (QAIC)
+
+The Qualcomm AI Cloud Inference (QAIC) AIC100 is a dedicated AI inference accelerator designed for:
+
+✅ Large-scale AI serving
+✅ Energy-efficient inference
+✅ On-prem deployment
+
+#### 🔸 Architectural Characteristics
+
+| Feature | Description |
+|---------|-------------|
+| Compute type | Dedicated inference accelerator |
+| Execution model | Optimized for continuous inference |
+| Parallelism | Multi-device scaling |
+| Memory model | KV cache intensive |
+| Workload target | Sustained decode workloads |
+
+#### 🔸 Hardware Specification
+
+| Component | Description |
+|----------|------------|
+| Accelerator | Qualcomm Cloud AI 100 Ultra (AIC100 Ultra) |
+| Device Interface | PCIe-based accelerator |
+| Device Nodes | `/dev/accel/accel*` |
+| Deployment Mode | Multi-device (4 cards in this benchmark) |
+| Target Workload | LLM inference (Generative AI workloads) |
+
+Reference: [Cloud AI 100 Ultra Overview](https://www.qualcomm.com/artificial-intelligence/data-center/cloud-ai-100-ultra#Overview)
+
+### 3.2 Software Environment
+
+- Ubuntu Linux host
+- Python3
+- Docker
+- Qualcomm Cloud AI SDK (Platform & Apps) v1.21.4
+- vLLM (OpenAI-compatible serving)
+
+> ✅ Before you begin following this guide, you need to pre‑install the [Qualcomm Cloud AI SDK](https://quic.github.io/cloud-ai-sdk-pages/1.21/Getting-Started/Installation/sdk-installation.html) on the AIC100.
+
+---
+## 4.⚙️Environment Setup
+
+### 4.1 Download the Docker Image
+
+To enable multi-model server endpoints, you need to install the Docker image from the [Cloud AI Containers](https://github.com/quic/cloud-ai-containers/pkgs/container/cloud_ai_inference_ubuntu22).
+
+Use the following command to download the image:
+```bash
+docker pull ghcr.io/quic/cloud_ai_inference_ubuntu22:1.21.4.0
+```
+
+Verify that the image was downloaded successfully:
+```bash
+docker images
+```
+If successful, the repository should appear in the list as shown below:
+
+
+💡*This sample uses the Docker image version cloud_ai_inference_ubuntu22:1.21.4.0.*
+
+### 4.2 Verify Available QAIC Devices
+
+Before creating containers, verify the available QAIC devices using:
+```bash
+sudo /opt/qti-aic/tools/qaic-util -t 1
+```
+
+💡*To reproduce this multi-model server setup, four QAIC devices are required.*
+
+### 4.3 Create Containers for Server Endpoints
+
+In this sample, four LLM models are deployed, so four containers are required—one per server endpoint.
+
+``` bash
+docker run -dit --name vllm-aic100-s1-dg0 --device=/dev/accel/accel0 -v /home/qitc/:/home/qitc/ -p 8000:8000 ghcr.io/quic/cloud_ai_inference_ubuntu22:1.21.4.0
+
+docker run -dit --name vllm-aic100-s2-dg1 --device=/dev/accel/accel1 -v /home/qitc/:/home/qitc/ -p 8001:8000 ghcr.io/quic/cloud_ai_inference_ubuntu22:1.21.4.0
+
+docker run -dit --name vllm-aic100-s3-dg2 --device=/dev/accel/accel2 -v /home/qitc/:/home/qitc/ -p 8002:8000 ghcr.io/quic/cloud_ai_inference_ubuntu22:1.21.4.0
+
+docker run -dit --name vllm-aic100-s4-dg3 --device=/dev/accel/accel3 -v /home/qitc/:/home/qitc/ -p 8003:8000 ghcr.io/quic/cloud_ai_inference_ubuntu22:1.21.4.0
+```
+Each container maps to one device and one port for serving an independent LLM endpoint.
+
+This configuration ensures that each vLLM instance is pinned to a dedicated QAIC device, enabling isolated and predictable performance measurement.
+
+### 4.4 Launching the Server
+
+After creating the containers, all required dependencies are pre-installed.
+Only the model download and server initialization are needed.
+
+### 🔹 Hugging Face Setup
+
+Enter each container:
+```bash
+docker exec -it vllm-aic100-s1-dg0 /bin/bash
+
+docker exec -it vllm-aic100-s2-dg1 /bin/bash
+
+docker exec -it vllm-aic100-s3-dg2 /bin/bash
+
+docker exec -it vllm-aic100-s4-dg3 /bin/bash
+```
+
+Activate the pre-configured virtual environment (in each container):
+```bash
+source /opt/vllm-env/bin/activate
+```
+
+Login to Hugging Face (required for model access):
+```bash
+huggingface-cli login
+```
+
+
+💡*Note: If you don't have a token, sign in and request access from the model page: [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).*
+
+### 🔹 Start the vLLM Server
+
+Run the following command in each container to launch the server:
+```bash
+python3 -m vllm.entrypoints.openai.api_server \
+--host 0.0.0.0 \
+--port 8000 \
+--device-group 0 \
+--model meta-llama/Llama-3.2-3B-Instruct \
+--max-model-len 1024 \
+--block-size 16 \
+--quantization mxfp6 \
+--kv-cache-dtype auto \
+--disable-sliding-window \
+--max-num-seqs 32
+```
+If the server starts successfully, you should see logs similar to:
+
+
+---
+## 5.📊Benchmark Methodology
+
+### 🔹 Test Configuration
+
+| Parameter | Value |
+|-----------|-------|
+| Model | meta-llama/Llama-3.2-3B-Instruct |
+| Prompt | "Explain what AI is in simple terms." |
+| max_tokens | 32 / 128 / 512 |
+| Concurrency | 1 → 64 |
+| Duration | 20 sec |
+| Warmup | 3 sec |
+| Repeats | 3 |
+
+### 🔹 Metrics
+
+✅ Throughput
+> TPS = number of successful requests completed per second
+- Calculated over the measurement window (excluding warmup)
+- Only successful responses are counted
+
+✅ Tokens per Second (tokens/s)
+> tokens/s = total completion tokens generated per second
+
+In this benchmark:
+- Most requests generate exactly `max_tokens`
+- Therefore, tokens/s ≈ TPS × max_tokens
+
+This results in near-linear scaling between TPS and tokens/s.
+
+✅ Latency
+- Average latency: mean end-to-end time per successful request
+- Maximum latency: worst-case request latency observed
+
+Includes:
+- Network overhead
+- Scheduling delay
+- Model inference time
+
+✅ Per-endpoint Metrics
+- TPS per endpoint (`ep0 ~ ep3`)
+- Tokens/sec per endpoint
+
+Used to evaluate:
+- Load distribution across QAIC devices
+- System balance and hardware utilization
+
+✅ Failure Rate
+- Counts failed requests per endpoint and globally
+- Includes:
+ - Timeout failures
+ - Unsuccessful responses
+
+A high failure rate typically indicates:
+- System saturation
+- Excessive queueing delay
+- Requests exceeding timeout limits under high concurrency
+
+
+💡*Note: In this benchmark, tokens/sec scales approximately linearly with TPS under fixed max_tokens settings, indicating a stable and deterministic workload.*
+
+---
+## 6.📦Benchmark Outputs
+
+This benchmark generates structured outputs for performance analysis and visualization.
+
+These include raw metrics, throughput plots, latency plots, and multi-workload comparisons.
+
+
+### 🔹 Result Files
+
+| Category | Files | Purpose |
+|---------|------|--------|
+| Raw Metrics | `multi_results_tok*.csv` | Stores aggregated throughput, latency, and failure metrics |
+| Throughput Plots | `multi_tps_tok*.png` | Visualizes TPS scaling behavior |
+| Latency Plots | `multi_latency_tok*.png` | Visualizes latency trends across concurrency |
+| Per-endpoint Metrics | `per_endpoint_tps_tok*.png` | Shows load distribution across endpoints |
+| Comparative Analysis | `compare_*multi_tokens*.png` | Compares different workload behaviors |
+
+All outputs are stored in:
+```
+vLLM_QAIC_Concurrency_Benchmark_on_AIC100/results/
+```
+
+---
+## 7.🚀Demo
+
+This section demonstrates the performance of QAIC under a representative workload:
+
+```code
+max_tokens = 128 (balanced workload)
+```
+
+This setting provides the best trade-off between throughput, latency, and system stability.
+
+### 🔹 Run the Benchmark
+
+
+Execute the following command on the host machine (outside the container, e.g., AIC100 server).
+```bash
+python3 ./benchmark_multi_server_full_multitokens.py
+```
+💡*Note: This benchmark script runs outside the container and sends requests to the vLLM endpoints exposed by each container.*
+## 🔹 Key Results (max_tokens=128)
+
+| Concurrency | TPS (req/s) | Tokens/s | Avg Latency (s) |
+|------------|------------|----------|----------------|
+| 8 | 1.20 | 153.60 | 9.31 |
+| 16 | 2.40 | 307.20 | 10.86 |
+| 24 | 2.40 | 307.20 | 12.30 |
+| 32 | 3.20 | 409.60 | 13.71 |
+| 40 | 4.00 | 512.00 | 16.60 |
+| 48 | 4.80 | 614.40 | 17.93 |
+| 56 | 5.60 | 716.80 | 19.29 |
+| 64 | 6.40 | 819.20✅ | 20.61 |
+
+
+## 🔹 Throughput Scaling
+
+The system demonstrates steady throughput scaling across all tested concurrency levels, with no significant degradation observed up to concurrency = 64.
+
+Throughput continues to increase alongside concurrency, while latency grows gradually, indicating a shift toward a latency-bound regime at higher concurrency levels.
+
+
+
+## 🔹 Per-Endpoint Load Distribution
+
+
+
+Each endpoint shows nearly identical throughput across all tested concurrency levels, indicating:
+
+- ✅ Balanced load distribution across QAIC devices
+- ✅ No single-device bottleneck
+- ✅ Stable multi-device scaling behavior
+
+This confirms that the benchmarking setup effectively utilizes all available hardware resources.
+
+
+---
+
+## ✅Summary
+
+This benchmark demonstrates the performance characteristics of QAIC under different concurrency levels and decode workloads.
+
+Key findings include:
+
+- ✅ Throughput increases steadily across the entire concurrency range (1 → 64)
+- ✅ Peak performance of ~819 tokens/sec under `max_tokens = 128`
+- ✅ No obvious throughput collapse observed within tested range
+- ✅ Latency increases gradually with concurrency
+- ✅ Stable and balanced utilization across all QAIC devices
+
+
+💡 *Note: This benchmark is performed using: `meta-llama/Llama-3.2-3B-Instruct`.
+Results may vary if different models (e.g., Qwen) are used due to architectural and decoding differences.*
+
+---
+
+## 🔧Customization
+
+This benchmark can be easily adapted to evaluate different models or configurations.
+
+To run tests with your own model, modify the following parameters in the script:
+
+- `MODEL_NAME`
+- `MAX_TOKENS_LIST`
+- `CONCURRENCY_LEVELS`
+
+For this example, the benchmark was performed using:
+```python
+meta-llama/Llama-3.2-3B-Instruct
+```
+Users can replace the model with any supported LLM (e.g., other Hugging Face models) to evaluate performance under different workloads.
\ No newline at end of file
diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/benchmark_multi_server_full_multitokens.py b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/benchmark_multi_server_full_multitokens.py
new file mode 100644
index 0000000..7120801
--- /dev/null
+++ b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/benchmark_multi_server_full_multitokens.py
@@ -0,0 +1,405 @@
+#===-- benchmark_multi_server_full_multitokens.py ------------------------===//
+# Part of the Startup-Demos Project, under the MIT License
+# See https://github.com/qualcomm/Startup-Demos/blob/main/LICENSE.txt
+# for license information.
+# Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries.
+# SPDX-License-Identifier: MIT License
+#===----------------------------------------------------------------------===//
+
+import asyncio
+import aiohttp
+import time
+import csv
+import os
+from datetime import datetime
+import matplotlib.pyplot as plt
+
+# =========================
+# Config
+# =========================
+ENDPOINTS = [
+ "http://localhost:8000/v1/chat/completions",
+ "http://localhost:8001/v1/chat/completions",
+ "http://localhost:8002/v1/chat/completions",
+ "http://localhost:8003/v1/chat/completions",
+]
+
+MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct"
+PROMPT = "Explain what AI is in simple terms."
+TEMPERATURE = 0.7
+
+# Sweep decode workload
+MAX_TOKENS_LIST = [32, 128, 512]
+
+# Sweep concurrency
+CONCURRENCY_LEVELS = [1, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64]
+
+# Measurement windows
+DURATION_S = 20
+WARMUP_S = 3
+REPEATS_PER_LEVEL = 3
+REQUEST_TIMEOUT_S = 60
+
+# Metrics
+ENABLE_TOKEN_METRICS = True
+
+# Output
+RUN_TAG = datetime.now().strftime("%Y%m%d_%H%M%S")
+OUT_DIR = "results"
+
+
+# =========================
+# Payload builder
+# =========================
+def build_payload(max_tokens: int):
+ return {
+ "model": MODEL_NAME,
+ "messages": [{"role": "user", "content": PROMPT}],
+ "max_tokens": max_tokens,
+ "temperature": TEMPERATURE,
+ }
+
+
+# =========================
+# One request (E2E latency)
+# =========================
+async def send_one(session, url, payload):
+ start = time.perf_counter()
+ try:
+ async with session.post(url, json=payload, timeout=REQUEST_TIMEOUT_S) as resp:
+ data = await resp.json()
+ elapsed = time.perf_counter() - start
+
+ if resp.status < 200 or resp.status >= 300:
+ return False, elapsed, None
+
+ completion_tokens = None
+ if ENABLE_TOKEN_METRICS:
+ usage = data.get("usage", {})
+ if isinstance(usage, dict):
+ completion_tokens = usage.get("completion_tokens", None)
+
+ return True, elapsed, completion_tokens
+
+ except Exception:
+ return False, None, None
+
+
+# =========================
+# Worker: fixed endpoint pinning (worker_id % N)
+# =========================
+async def worker(worker_id, session, end_t, warmup_end_t, per_worker_stats, payload):
+ n = len(ENDPOINTS)
+ url = ENDPOINTS[worker_id % n]
+ ep_idx = worker_id % n
+
+ local_success = 0
+ local_fail = 0
+ local_lat = []
+ local_tokens = 0
+
+ while True:
+ now = time.perf_counter()
+ if now >= end_t:
+ break
+
+ ok, latency, comp_tokens = await send_one(session, url, payload)
+ done_t = time.perf_counter()
+
+ # Use completion time for warmup filtering
+ if done_t < warmup_end_t:
+ continue
+
+ if ok:
+ local_success += 1
+ if latency is not None:
+ local_lat.append(latency)
+ if comp_tokens is not None:
+ local_tokens += comp_tokens
+ else:
+ local_fail += 1
+
+ per_worker_stats.append((ep_idx, local_success, local_fail, local_lat, local_tokens))
+
+
+# =========================
+# Run one level (one concurrency, one max_tokens) for fixed duration
+# =========================
+async def run_level(concurrency, max_tokens):
+ payload = build_payload(max_tokens)
+
+ timeout = aiohttp.ClientTimeout(total=REQUEST_TIMEOUT_S + 5)
+ async with aiohttp.ClientSession(timeout=timeout) as session:
+ t0 = time.perf_counter()
+ warmup_end = t0 + WARMUP_S
+ end_t = t0 + WARMUP_S + DURATION_S
+
+ per_worker_stats = []
+ tasks = [
+ worker(i, session, end_t, warmup_end, per_worker_stats, payload)
+ for i in range(concurrency)
+ ]
+ await asyncio.gather(*tasks)
+
+ # Aggregate per-endpoint
+ n = len(ENDPOINTS)
+ ep_success = [0] * n
+ ep_fail = [0] * n
+ ep_latencies = [[] for _ in range(n)]
+ ep_tokens = [0] * n
+
+ for ep_idx, s, f, lats, toks in per_worker_stats:
+ ep_success[ep_idx] += s
+ ep_fail[ep_idx] += f
+ ep_latencies[ep_idx].extend(lats)
+ ep_tokens[ep_idx] += toks
+
+ total_success = sum(ep_success)
+ total_fail = sum(ep_fail)
+ all_lat = [x for sub in ep_latencies for x in sub]
+
+ # Throughput over measured window (exclude warmup)
+ tps = total_success / DURATION_S if DURATION_S > 0 else 0.0
+ tok_per_s = (sum(ep_tokens) / DURATION_S) if (ENABLE_TOKEN_METRICS and DURATION_S > 0) else None
+
+ avg_lat = (sum(all_lat) / len(all_lat)) if all_lat else None
+ max_lat = max(all_lat) if all_lat else None
+
+ ep_tps = [s / DURATION_S for s in ep_success]
+ ep_tokps = [(t / DURATION_S) for t in ep_tokens] if ENABLE_TOKEN_METRICS else [None] * n
+ ep_avg_lat = [(sum(l)/len(l) if l else None) for l in ep_latencies]
+
+ return {
+ "concurrency": concurrency,
+ "max_tokens": max_tokens,
+
+ "total_success": total_success,
+ "total_fail": total_fail,
+
+ "tps_req_per_s": tps,
+ "tok_per_s": tok_per_s,
+
+ "avg_latency_s": avg_lat,
+ "max_latency_s": max_lat,
+
+ "ep_success": ep_success,
+ "ep_fail": ep_fail,
+ "ep_tps": ep_tps,
+ "ep_tokps": ep_tokps,
+ "ep_avg_lat": ep_avg_lat,
+ }
+
+
+# =========================
+# Repeat & aggregate
+# =========================
+def mean(vals):
+ vals = [v for v in vals if v is not None]
+ return sum(vals) / len(vals) if vals else None
+
+
+async def run_repeats(concurrency, repeats, max_tokens):
+ runs = []
+ for _ in range(repeats):
+ runs.append(await run_level(concurrency, max_tokens))
+ return runs
+
+
+def aggregate_runs(runs):
+ n = len(ENDPOINTS)
+ return {
+ "concurrency": runs[0]["concurrency"],
+ "max_tokens": runs[0]["max_tokens"],
+
+ "tps_req_per_s_mean": mean([r["tps_req_per_s"] for r in runs]),
+ "tok_per_s_mean": mean([r["tok_per_s"] for r in runs]),
+ "avg_latency_s_mean": mean([r["avg_latency_s"] for r in runs]),
+ "max_latency_s_mean": mean([r["max_latency_s"] for r in runs]),
+
+ **{f"ep{i}_tps_mean": mean([r["ep_tps"][i] for r in runs]) for i in range(n)},
+ **{f"ep{i}_tokps_mean": mean([r["ep_tokps"][i] for r in runs]) for i in range(n)},
+ **{f"ep{i}_avg_lat_mean": mean([r["ep_avg_lat"][i] for r in runs]) for i in range(n)},
+ **{f"ep{i}_fail_mean": mean([r["ep_fail"][i] for r in runs]) for i in range(n)},
+
+ "total_fail_mean": mean([r["total_fail"] for r in runs]),
+ "total_success_mean": mean([r["total_success"] for r in runs]),
+ }
+
+
+# =========================
+# IO & plotting
+# =========================
+def safe_format(x, digits=2):
+ return f"{x:.{digits}f}" if x is not None else "N/A"
+
+
+def save_csv(rows, path):
+ os.makedirs(os.path.dirname(path), exist_ok=True)
+ with open(path, "w", newline="", encoding="utf-8") as f:
+ w = csv.DictWriter(f, fieldnames=list(rows[0].keys()))
+ w.writeheader()
+ w.writerows(rows)
+
+
+def plot_overall_tps(rows, out_png, title_suffix=""):
+ conc = [r["concurrency"] for r in rows]
+ tps = [r["tps_req_per_s_mean"] for r in rows]
+ plt.figure()
+ plt.plot(conc, tps, marker="o")
+ plt.xlabel("Concurrency (total across 4 servers)")
+ plt.ylabel("TPS (requests/sec)")
+ plt.title(f"Aggregate TPS vs Concurrency (4 servers) - {MODEL_NAME}{title_suffix}")
+ plt.grid(True, linestyle="--", alpha=0.6)
+ os.makedirs(os.path.dirname(out_png), exist_ok=True)
+ plt.tight_layout(rect=[0, 0, 1, 0.95])
+ plt.savefig(out_png, dpi=150, bbox_inches="tight", pad_inches=0.2)
+ plt.close()
+
+
+def plot_latency(rows, out_png, title_suffix=""):
+ conc = [r["concurrency"] for r in rows]
+ avg_lat = [r["avg_latency_s_mean"] for r in rows]
+ max_lat = [r["max_latency_s_mean"] for r in rows]
+ plt.figure()
+ plt.plot(conc, avg_lat, marker="o", label="Avg latency")
+ plt.plot(conc, max_lat, marker="x", label="Max latency")
+ plt.xlabel("Concurrency (total across 4 servers)")
+ plt.ylabel("Latency (sec)")
+ plt.title(f"Latency vs Concurrency (4 servers) - {MODEL_NAME}{title_suffix}")
+ plt.legend()
+ plt.grid(True, linestyle="--", alpha=0.6)
+ os.makedirs(os.path.dirname(out_png), exist_ok=True)
+ plt.tight_layout(rect=[0, 0, 1, 0.95])
+ plt.savefig(out_png, dpi=150, bbox_inches="tight", pad_inches=0.2)
+ plt.close()
+
+
+def plot_per_endpoint_tps(rows, out_png, title_suffix="", jitter_x=True):
+ """
+ jitter_x=True will shift x slightly per endpoint so overlapping lines are visible.
+ """
+ conc = [r["concurrency"] for r in rows]
+ plt.figure()
+
+ offsets = [-0.30, -0.10, 0.10, 0.30]
+ styles = ["-", "--", "-.", ":"]
+
+ for i in range(len(ENDPOINTS)):
+ tps_i = [r[f"ep{i}_tps_mean"] for r in rows]
+ x_i = [(c + offsets[i]) for c in conc] if jitter_x else conc
+ plt.plot(x_i, tps_i, marker="o", linestyle=styles[i], linewidth=2, label=f"endpoint {i}")
+
+ plt.xlabel("Concurrency (total across 4 servers)")
+ plt.ylabel("TPS per endpoint (req/s)")
+ plt.title(f"Per-endpoint TPS vs Concurrency - {MODEL_NAME}{title_suffix}")
+ plt.legend()
+ plt.grid(True, linestyle="--", alpha=0.6)
+ os.makedirs(os.path.dirname(out_png), exist_ok=True)
+ plt.tight_layout(rect=[0, 0, 1, 0.95])
+ plt.savefig(out_png, dpi=150, bbox_inches="tight", pad_inches=0.2)
+ plt.close()
+
+
+def plot_multi_tokens_curves(all_rows_by_tokens, out_png, y_key, y_label, title):
+ """
+ Draw multiple lines (one per max_tokens) on the same chart.
+ y_key: e.g. "tps_req_per_s_mean" or "tok_per_s_mean"
+ """
+ plt.figure()
+ for max_tokens, rows in all_rows_by_tokens.items():
+ conc = [r["concurrency"] for r in rows]
+ y = [r[y_key] for r in rows]
+ plt.plot(conc, y, marker="o", label=f"max_tokens={max_tokens}")
+
+ plt.xlabel("Concurrency (total across 4 servers)")
+ plt.ylabel(y_label)
+ plt.title(title)
+ plt.legend()
+ plt.grid(True, linestyle="--", alpha=0.6)
+ os.makedirs(os.path.dirname(out_png), exist_ok=True)
+ plt.tight_layout(rect=[0, 0, 1, 0.95])
+ plt.savefig(out_png, dpi=150, bbox_inches="tight", pad_inches=0.2)
+ plt.close()
+
+
+# =========================
+# Main
+# =========================
+async def main():
+ print("Endpoints:")
+ for e in ENDPOINTS:
+ print(" ", e)
+ print(f"\nModel={MODEL_NAME}, duration={DURATION_S}s, warmup={WARMUP_S}s, repeats={REPEATS_PER_LEVEL}")
+ print(f"MAX_TOKENS_LIST={MAX_TOKENS_LIST}\n")
+
+ os.makedirs(OUT_DIR, exist_ok=True)
+
+ # Collect results for final multi-curve plots
+ all_rows_by_tokens = {}
+
+ for max_tokens in MAX_TOKENS_LIST:
+ print("\n==============================")
+ print(f" MAX_TOKENS = {max_tokens}")
+ print("==============================\n")
+
+ agg_rows = []
+
+ for c in CONCURRENCY_LEVELS:
+ runs = await run_repeats(c, REPEATS_PER_LEVEL, max_tokens)
+ agg = aggregate_runs(runs)
+ agg_rows.append(agg)
+
+ print(f"=== Concurrency {c} ===")
+ print(f"Aggregate TPS(mean): {safe_format(agg['tps_req_per_s_mean'])} req/s")
+ print(f"Tokens/s(mean): {safe_format(agg['tok_per_s_mean'])}")
+ print(f"Avg latency(mean): {safe_format(agg['avg_latency_s_mean'])} s")
+ print(f"Max latency(mean): {safe_format(agg['max_latency_s_mean'])} s")
+ print(f"Total fail(mean): {safe_format(agg['total_fail_mean'])}")
+ print("")
+
+ all_rows_by_tokens[max_tokens] = agg_rows
+
+ # Per max_tokens outputs
+ csv_path = f"{OUT_DIR}/multi_results_tok{max_tokens}_{RUN_TAG}.csv"
+ tps_png = f"{OUT_DIR}/multi_tps_tok{max_tokens}_{RUN_TAG}.png"
+ lat_png = f"{OUT_DIR}/multi_latency_tok{max_tokens}_{RUN_TAG}.png"
+ per_png = f"{OUT_DIR}/per_endpoint_tps_tok{max_tokens}_{RUN_TAG}.png"
+
+ save_csv(agg_rows, csv_path)
+ plot_overall_tps(agg_rows, tps_png, title_suffix=f" (max_tokens={max_tokens})")
+ plot_latency(agg_rows, lat_png, title_suffix=f" (max_tokens={max_tokens})")
+ plot_per_endpoint_tps(agg_rows, per_png, title_suffix=f" (max_tokens={max_tokens})", jitter_x=True)
+
+ print("Saved:")
+ print(" CSV :", csv_path)
+ print(" TPS :", tps_png)
+ print(" LAT :", lat_png)
+ print(" PER :", per_png)
+
+ # Final multi-curve summary plots (compare max_tokens)
+ multi_tps_png = f"{OUT_DIR}/compare_tps_multi_tokens_{RUN_TAG}.png"
+ multi_tokps_png = f"{OUT_DIR}/compare_tokps_multi_tokens_{RUN_TAG}.png"
+
+ plot_multi_tokens_curves(
+ all_rows_by_tokens,
+ multi_tps_png,
+ y_key="tps_req_per_s_mean",
+ y_label="TPS (requests/sec)",
+ title=f"Aggregate TPS vs Concurrency (compare max_tokens) - {MODEL_NAME}"
+ )
+
+ plot_multi_tokens_curves(
+ all_rows_by_tokens,
+ multi_tokps_png,
+ y_key="tok_per_s_mean",
+ y_label="Tokens/sec",
+ title=f"Tokens/sec vs Concurrency (compare max_tokens) - {MODEL_NAME}"
+ )
+
+ print("\nSaved multi-curve comparison plots:")
+ print(" TPS compare :", multi_tps_png)
+ print(" Tok/s compare:", multi_tokps_png)
+
+
+if __name__ == "__main__":
+ asyncio.run(main())
\ No newline at end of file
diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/docker-images-show.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/docker-images-show.png
new file mode 100644
index 0000000..d9c285c
Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/docker-images-show.png differ
diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/huggingface_cli_login.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/huggingface_cli_login.png
new file mode 100644
index 0000000..5ec0507
Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/huggingface_cli_login.png differ
diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/multi_tps_tok128.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/multi_tps_tok128.png
new file mode 100644
index 0000000..587a424
Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/multi_tps_tok128.png differ
diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/per_endpoint_tps_tok128.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/per_endpoint_tps_tok128.png
new file mode 100644
index 0000000..c2f6c36
Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/per_endpoint_tps_tok128.png differ
diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/qaic-devices-status.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/qaic-devices-status.png
new file mode 100644
index 0000000..1e4bc7d
Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/qaic-devices-status.png differ
diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/server-start-log.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/server-start-log.png
new file mode 100644
index 0000000..a02beb5
Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/server-start-log.png differ
diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/project.yaml b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/project.yaml
new file mode 100644
index 0000000..18eb795
--- /dev/null
+++ b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/project.yaml
@@ -0,0 +1,11 @@
+name: "vLLM QAIC Concurrency Benchmark on AIC100"
+description: "Provides a reference benchmark for evaluating vLLM serving performance on Qualcomm AIC100 (QAIC), focusing on concurrency scaling, throughput, and multi-request LLM inference behavior."
+category:
+ - "GenAI"
+ - "CloudAI-Playground"
+platforms:
+ - "AIC100 Ultra"
+tags:
+ - "vLLM"
+ - "LLM Benchmark"
+ - "QAIC"
\ No newline at end of file
diff --git a/README.md b/README.md
index ebc5ec6..6f68208 100644
--- a/README.md
+++ b/README.md
@@ -191,6 +191,9 @@ This repository is a collection of demo applications that highlight the capabili
+
+
+