diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/README.md b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/README.md
new file mode 100644
index 0000000..b4df2db
--- /dev/null
+++ b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/README.md
@@ -0,0 +1,428 @@
+# [Startup_Demo](../../../)/[GenAI](../../)/[Cloud AI Playground](../)/[vLLM QAIC Concurrency Benchmark on AIC100](./)
+# vLLM QAIC Concurrency Benchmark on AIC100
+
+## 📘Table of Contents
+- [🧭Overview](#1overview)
+- [✨Features](#2features)
+- [🐳Requirements](#3requirements)
+- [⚙️Environment Setup](#4️environment-setup)
+- [📊Benchmark Methodology](#5benchmark-methodology)
+- [📦Benchmark Outputs](#6benchmark-outputs)
+- [🚀Demo](#7demo)
+- [✅Summary](#summary)
+- [🔧Customization](#customization)
+---
+## 1.🧭Overview
+
+This project benchmarks Large Language Model (LLM) serving performance on Qualcomm AIC100 using vLLM.
+
+The benchmark evaluates:
+- System-level throughput
+- Concurrency scaling behavior
+- Decode workload efficiency
+- System limits under heavy load
+
+The purpose is to provide a **comprehensive evaluation of QAIC hardware performance for LLM inference workloads**, including both optimal operating regions and failure conditions.
+
+```mermaid
+flowchart LR
+    Client["Benchmark Script (host machine)"]
+
+    subgraph Servers["vLLM Endpoints (4x)"]
+        S1[":8000\nDevice 0"]
+        S2[":8001\nDevice 1"]
+        S3[":8002\nDevice 2"]
+        S4[":8003\nDevice 3"]
+    end
+
+    D0["QAIC accel0"]
+    D1["QAIC accel1"]
+    D2["QAIC accel2"]
+    D3["QAIC accel3"]
+
+    Client -->|Concurrent Requests| S1
+    Client -->|Concurrent Requests| S2
+    Client -->|Concurrent Requests| S3
+    Client -->|Concurrent Requests| S4
+
+    S1 --> D0
+    S2 --> D1
+    S3 --> D2
+    S4 --> D3
+
+    D0 -->|Tokens| S1
+    D1 -->|Tokens| S2
+    D2 -->|Tokens| S3
+    D3 -->|Tokens| S4
+
+    S1 -->|Response| Client
+    S2 -->|Response| Client
+    S3 -->|Response| Client
+    S4 -->|Response| Client
+```
+Each vLLM endpoint is pinned to a dedicated QAIC device, enabling independent inference pipelines and predictable scaling behavior.
+
+
+---
+## 2.✨Features
+
+This benchmark provides the following analysis capabilities:
+
+### 🔹 System Throughput Characterization
+
+Measures total system throughput across multiple vLLM endpoints:
+- Requests per second (TPS)
+- Tokens per second (tokens/s)
+
+Allows identification of:
+- Peak performance
+- Throughput saturation point
+
+### 🔹 Concurrency Scaling Analysis
+Evaluates system behavior as concurrency increases:
+> concurrency = 1 → 64
+
+Captures:
+- Linear scaling region
+- Saturation region
+- Performance degradation
+
+### 🔹 Decode Workload Sensitivity
+Analyzes performance under different generation lengths:
+> max_tokens = [32, 128, 512]
+
+Reveals:
+- Overhead-bound vs compute-bound regimes
+- Efficiency trade-offs across workloads
+
+### 🔹 Multi-Endpoint Load Distribution
+Distributes load across 4 vLLM endpoints:
+> worker_id % 4
+
+Verifies:
+- Balanced hardware utilization
+- Absence of bottleneck devices
+
+### 🔹 Failure and Stability Detection
+Identifies system limits under stress and failure boundaries:
+- Request failures
+- Timeout conditions
+- Throughput collapse
+
+---
+## 3.🐳Requirements
+
+This section describes both the hardware platform (QAIC) and the software environment used for benchmarking.
+
+
+### 3.1 AIC100 Hardware Platform (QAIC)
+
+The Qualcomm AI Cloud Inference (QAIC) AIC100 is a dedicated AI inference accelerator designed for:
+
+✅ Large-scale AI serving
+✅ Energy-efficient inference
+✅ On-prem deployment
+
+#### 🔸 Architectural Characteristics
+
+| Feature | Description |
+|---------|-------------|
+| Compute type | Dedicated inference accelerator |
+| Execution model | Optimized for continuous inference |
+| Parallelism | Multi-device scaling | 
+| Memory model | KV cache intensive |
+| Workload target | Sustained decode workloads |
+
+#### 🔸 Hardware Specification
+
+| Component | Description |
+|----------|------------|
+| Accelerator | Qualcomm Cloud AI 100 Ultra (AIC100 Ultra) |
+| Device Interface | PCIe-based accelerator |
+| Device Nodes | `/dev/accel/accel*` |
+| Deployment Mode | Multi-device (4 cards in this benchmark) |
+| Target Workload | LLM inference (Generative AI workloads) |
+
+Reference: [Cloud AI 100 Ultra Overview](https://www.qualcomm.com/artificial-intelligence/data-center/cloud-ai-100-ultra#Overview)
+
+### 3.2 Software Environment
+
+- Ubuntu Linux host
+- Python3
+- Docker
+- Qualcomm Cloud AI SDK (Platform & Apps) v1.21.4
+- vLLM (OpenAI-compatible serving)
+
+> ✅ Before you begin following this guide, you need to pre‑install the [Qualcomm Cloud AI SDK](https://quic.github.io/cloud-ai-sdk-pages/1.21/Getting-Started/Installation/sdk-installation.html) on the AIC100.
+
+---
+## 4.⚙️Environment Setup
+
+### 4.1 Download the Docker Image
+
+To enable multi-model server endpoints, you need to install the Docker image from the [Cloud AI Containers](https://github.com/quic/cloud-ai-containers/pkgs/container/cloud_ai_inference_ubuntu22).
+
+Use the following command to download the image:
+```bash
+docker pull ghcr.io/quic/cloud_ai_inference_ubuntu22:1.21.4.0
+```
+
+Verify that the image was downloaded successfully:
+```bash
+docker images
+```
+If successful, the repository should appear in the list as shown below:
+![N|Solid](./images/docker-images-show.png)
+
+💡*This sample uses the Docker image version cloud_ai_inference_ubuntu22:1.21.4.0.*
+
+### 4.2 Verify Available QAIC Devices
+
+Before creating containers, verify the available QAIC devices using:
+```bash
+sudo /opt/qti-aic/tools/qaic-util -t 1
+```
+![N|Solid](./images/qaic-devices-status.png)
+💡*To reproduce this multi-model server setup, four QAIC devices are required.*
+
+### 4.3 Create Containers for Server Endpoints
+
+In this sample, four LLM models are deployed, so four containers are required—one per server endpoint.
+
+``` bash
+docker run -dit --name vllm-aic100-s1-dg0 --device=/dev/accel/accel0 -v /home/qitc/:/home/qitc/ -p 8000:8000 ghcr.io/quic/cloud_ai_inference_ubuntu22:1.21.4.0
+
+docker run -dit --name vllm-aic100-s2-dg1 --device=/dev/accel/accel1 -v /home/qitc/:/home/qitc/ -p 8001:8000 ghcr.io/quic/cloud_ai_inference_ubuntu22:1.21.4.0
+
+docker run -dit --name vllm-aic100-s3-dg2 --device=/dev/accel/accel2 -v /home/qitc/:/home/qitc/ -p 8002:8000 ghcr.io/quic/cloud_ai_inference_ubuntu22:1.21.4.0
+
+docker run -dit --name vllm-aic100-s4-dg3 --device=/dev/accel/accel3 -v /home/qitc/:/home/qitc/ -p 8003:8000 ghcr.io/quic/cloud_ai_inference_ubuntu22:1.21.4.0
+```
+Each container maps to one device and one port for serving an independent LLM endpoint.
+
+This configuration ensures that each vLLM instance is pinned to a dedicated QAIC device, enabling isolated and predictable performance measurement.
+
+### 4.4 Launching the Server
+
+After creating the containers, all required dependencies are pre-installed.
+Only the model download and server initialization are needed.
+
+### 🔹 Hugging Face Setup
+
+Enter each container:
+```bash
+docker exec -it vllm-aic100-s1-dg0 /bin/bash
+
+docker exec -it vllm-aic100-s2-dg1 /bin/bash
+
+docker exec -it vllm-aic100-s3-dg2 /bin/bash
+
+docker exec -it vllm-aic100-s4-dg3 /bin/bash
+```
+
+Activate the pre-configured virtual environment (in each container):
+```bash
+source /opt/vllm-env/bin/activate
+```
+
+Login to Hugging Face (required for model access):
+```bash
+huggingface-cli login
+```
+![N|Solid](./images/huggingface_cli_login.png)
+
+💡*Note: If you don't have a token, sign in and request access from the model page: [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).*
+
+### 🔹 Start the vLLM Server
+
+Run the following command in each container to launch the server:
+```bash
+python3 -m vllm.entrypoints.openai.api_server \
+--host 0.0.0.0 \
+--port 8000 \
+--device-group 0 \
+--model meta-llama/Llama-3.2-3B-Instruct \
+--max-model-len 1024 \
+--block-size 16 \
+--quantization mxfp6 \
+--kv-cache-dtype auto \
+--disable-sliding-window \
+--max-num-seqs 32
+```
+If the server starts successfully, you should see logs similar to:
+![N|Solid](./images/server-start-log.png)
+
+---
+## 5.📊Benchmark Methodology
+
+### 🔹 Test Configuration
+
+| Parameter | Value | 
+|-----------|-------|
+| Model | meta-llama/Llama-3.2-3B-Instruct |
+| Prompt | "Explain what AI is in simple terms." | 
+| max_tokens | 32 / 128 / 512 |
+| Concurrency | 1 → 64 |
+| Duration | 20 sec | 
+| Warmup | 3 sec |
+| Repeats | 3 |
+
+### 🔹 Metrics
+
+✅ Throughput
+> TPS = number of successful requests completed per second
+- Calculated over the measurement window (excluding warmup)
+- Only successful responses are counted
+
+✅ Tokens per Second (tokens/s)
+> tokens/s = total completion tokens generated per second
+
+In this benchmark:
+- Most requests generate exactly `max_tokens`
+- Therefore, tokens/s ≈ TPS × max_tokens
+
+This results in near-linear scaling between TPS and tokens/s.
+
+✅ Latency
+- Average latency: mean end-to-end time per successful request  
+- Maximum latency: worst-case request latency observed  
+
+Includes:
+- Network overhead  
+- Scheduling delay  
+- Model inference time  
+
+✅ Per-endpoint Metrics
+- TPS per endpoint (`ep0 ~ ep3`)  
+- Tokens/sec per endpoint  
+
+Used to evaluate:
+- Load distribution across QAIC devices  
+- System balance and hardware utilization  
+
+✅ Failure Rate
+- Counts failed requests per endpoint and globally  
+- Includes:
+  - Timeout failures  
+  - Unsuccessful responses  
+
+A high failure rate typically indicates:
+- System saturation  
+- Excessive queueing delay  
+- Requests exceeding timeout limits under high concurrency
+
+
+💡*Note: In this benchmark, tokens/sec scales approximately linearly with TPS under fixed max_tokens settings, indicating a stable and deterministic workload.*
+
+---
+## 6.📦Benchmark Outputs
+
+This benchmark generates structured outputs for performance analysis and visualization.
+
+These include raw metrics, throughput plots, latency plots, and multi-workload comparisons.
+
+
+### 🔹 Result Files
+
+| Category | Files | Purpose |
+|---------|------|--------|
+| Raw Metrics | `multi_results_tok*.csv` | Stores aggregated throughput, latency, and failure metrics |
+| Throughput Plots | `multi_tps_tok*.png` | Visualizes TPS scaling behavior |
+| Latency Plots | `multi_latency_tok*.png` | Visualizes latency trends across concurrency |
+| Per-endpoint Metrics | `per_endpoint_tps_tok*.png` | Shows load distribution across endpoints |
+| Comparative Analysis | `compare_*multi_tokens*.png` | Compares different workload behaviors |
+
+All outputs are stored in: 
+```
+vLLM_QAIC_Concurrency_Benchmark_on_AIC100/results/
+```
+
+---
+## 7.🚀Demo
+
+This section demonstrates the performance of QAIC under a representative workload:
+
+```code
+max_tokens = 128 (balanced workload)
+```
+
+This setting provides the best trade-off between throughput, latency, and system stability.
+
+### 🔹 Run the Benchmark
+
+
+Execute the following command on the host machine (outside the container, e.g., AIC100 server).
+```bash
+python3 ./benchmark_multi_server_full_multitokens.py
+```
+💡*Note: This benchmark script runs outside the container and sends requests to the vLLM endpoints exposed by each container.*
+## 🔹 Key Results (max_tokens=128)
+
+| Concurrency | TPS (req/s) | Tokens/s | Avg Latency (s) |
+|------------|------------|----------|----------------|
+| 8  | 1.20 | 153.60 | 9.31 |
+| 16 | 2.40 | 307.20 | 10.86 |
+| 24 | 2.40 | 307.20 | 12.30 |
+| 32 | 3.20 | 409.60 | 13.71 |
+| 40 | 4.00 | 512.00 | 16.60 |
+| 48 | 4.80 | 614.40 | 17.93 |
+| 56 | 5.60 | 716.80 | 19.29 |
+| 64 | 6.40 | 819.20✅ | 20.61 |
+
+
+## 🔹 Throughput Scaling
+
+The system demonstrates steady throughput scaling across all tested concurrency levels, with no significant degradation observed up to concurrency = 64.
+
+Throughput continues to increase alongside concurrency, while latency grows gradually, indicating a shift toward a latency-bound regime at higher concurrency levels.
+
+![TPS vs Concurrency](images/multi_tps_tok128.png)
+
+## 🔹 Per-Endpoint Load Distribution
+
+![Per-endpoint TPS](images/per_endpoint_tps_tok128.png)
+
+Each endpoint shows nearly identical throughput across all tested concurrency levels, indicating:
+
+- ✅ Balanced load distribution across QAIC devices
+- ✅ No single-device bottleneck
+- ✅ Stable multi-device scaling behavior
+
+This confirms that the benchmarking setup effectively utilizes all available hardware resources.
+
+
+---
+
+## ✅Summary
+
+This benchmark demonstrates the performance characteristics of QAIC under different concurrency levels and decode workloads.
+
+Key findings include:
+
+- ✅ Throughput increases steadily across the entire concurrency range (1 → 64)
+- ✅ Peak performance of ~819 tokens/sec under `max_tokens = 128`
+- ✅ No obvious throughput collapse observed within tested range
+- ✅ Latency increases gradually with concurrency
+- ✅ Stable and balanced utilization across all QAIC devices
+
+
+💡 *Note: This benchmark is performed using: `meta-llama/Llama-3.2-3B-Instruct`.
+Results may vary if different models (e.g., Qwen) are used due to architectural and decoding differences.*
+
+---
+
+## 🔧Customization
+
+This benchmark can be easily adapted to evaluate different models or configurations.
+
+To run tests with your own model, modify the following parameters in the script:
+
+- `MODEL_NAME`
+- `MAX_TOKENS_LIST`
+- `CONCURRENCY_LEVELS`
+
+For this example, the benchmark was performed using:
+```python
+meta-llama/Llama-3.2-3B-Instruct
+```
+Users can replace the model with any supported LLM (e.g., other Hugging Face models) to evaluate performance under different workloads.
\ No newline at end of file
diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/benchmark_multi_server_full_multitokens.py b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/benchmark_multi_server_full_multitokens.py
new file mode 100644
index 0000000..7120801
--- /dev/null
+++ b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/benchmark_multi_server_full_multitokens.py
@@ -0,0 +1,405 @@
+#===-- benchmark_multi_server_full_multitokens.py ------------------------===//
+# Part of the Startup-Demos Project, under the MIT License
+# See https://github.com/qualcomm/Startup-Demos/blob/main/LICENSE.txt
+# for license information.
+# Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries.
+# SPDX-License-Identifier: MIT License
+#===----------------------------------------------------------------------===//
+
+import asyncio
+import aiohttp
+import time
+import csv
+import os
+from datetime import datetime
+import matplotlib.pyplot as plt
+
+# =========================
+# Config
+# =========================
+ENDPOINTS = [
+    "http://localhost:8000/v1/chat/completions",
+    "http://localhost:8001/v1/chat/completions",
+    "http://localhost:8002/v1/chat/completions",
+    "http://localhost:8003/v1/chat/completions",
+]
+
+MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct"
+PROMPT = "Explain what AI is in simple terms."
+TEMPERATURE = 0.7
+
+# Sweep decode workload
+MAX_TOKENS_LIST = [32, 128, 512]
+
+# Sweep concurrency
+CONCURRENCY_LEVELS = [1, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64]
+
+# Measurement windows
+DURATION_S = 20
+WARMUP_S = 3
+REPEATS_PER_LEVEL = 3
+REQUEST_TIMEOUT_S = 60
+
+# Metrics
+ENABLE_TOKEN_METRICS = True
+
+# Output
+RUN_TAG = datetime.now().strftime("%Y%m%d_%H%M%S")
+OUT_DIR = "results"
+
+
+# =========================
+# Payload builder
+# =========================
+def build_payload(max_tokens: int):
+    return {
+        "model": MODEL_NAME,
+        "messages": [{"role": "user", "content": PROMPT}],
+        "max_tokens": max_tokens,
+        "temperature": TEMPERATURE,
+    }
+
+
+# =========================
+# One request (E2E latency)
+# =========================
+async def send_one(session, url, payload):
+    start = time.perf_counter()
+    try:
+        async with session.post(url, json=payload, timeout=REQUEST_TIMEOUT_S) as resp:
+            data = await resp.json()
+            elapsed = time.perf_counter() - start
+
+            if resp.status < 200 or resp.status >= 300:
+                return False, elapsed, None
+
+            completion_tokens = None
+            if ENABLE_TOKEN_METRICS:
+                usage = data.get("usage", {})
+                if isinstance(usage, dict):
+                    completion_tokens = usage.get("completion_tokens", None)
+
+            return True, elapsed, completion_tokens
+
+    except Exception:
+        return False, None, None
+
+
+# =========================
+# Worker: fixed endpoint pinning (worker_id % N)
+# =========================
+async def worker(worker_id, session, end_t, warmup_end_t, per_worker_stats, payload):
+    n = len(ENDPOINTS)
+    url = ENDPOINTS[worker_id % n]
+    ep_idx = worker_id % n
+
+    local_success = 0
+    local_fail = 0
+    local_lat = []
+    local_tokens = 0
+
+    while True:
+        now = time.perf_counter()
+        if now >= end_t:
+            break
+
+        ok, latency, comp_tokens = await send_one(session, url, payload)
+        done_t = time.perf_counter()
+
+        # Use completion time for warmup filtering
+        if done_t < warmup_end_t:
+            continue
+
+        if ok:
+            local_success += 1
+            if latency is not None:
+                local_lat.append(latency)
+            if comp_tokens is not None:
+                local_tokens += comp_tokens
+        else:
+            local_fail += 1
+
+    per_worker_stats.append((ep_idx, local_success, local_fail, local_lat, local_tokens))
+
+
+# =========================
+# Run one level (one concurrency, one max_tokens) for fixed duration
+# =========================
+async def run_level(concurrency, max_tokens):
+    payload = build_payload(max_tokens)
+
+    timeout = aiohttp.ClientTimeout(total=REQUEST_TIMEOUT_S + 5)
+    async with aiohttp.ClientSession(timeout=timeout) as session:
+        t0 = time.perf_counter()
+        warmup_end = t0 + WARMUP_S
+        end_t = t0 + WARMUP_S + DURATION_S
+
+        per_worker_stats = []
+        tasks = [
+            worker(i, session, end_t, warmup_end, per_worker_stats, payload)
+            for i in range(concurrency)
+        ]
+        await asyncio.gather(*tasks)
+
+    # Aggregate per-endpoint
+    n = len(ENDPOINTS)
+    ep_success = [0] * n
+    ep_fail = [0] * n
+    ep_latencies = [[] for _ in range(n)]
+    ep_tokens = [0] * n
+
+    for ep_idx, s, f, lats, toks in per_worker_stats:
+        ep_success[ep_idx] += s
+        ep_fail[ep_idx] += f
+        ep_latencies[ep_idx].extend(lats)
+        ep_tokens[ep_idx] += toks
+
+    total_success = sum(ep_success)
+    total_fail = sum(ep_fail)
+    all_lat = [x for sub in ep_latencies for x in sub]
+
+    # Throughput over measured window (exclude warmup)
+    tps = total_success / DURATION_S if DURATION_S > 0 else 0.0
+    tok_per_s = (sum(ep_tokens) / DURATION_S) if (ENABLE_TOKEN_METRICS and DURATION_S > 0) else None
+
+    avg_lat = (sum(all_lat) / len(all_lat)) if all_lat else None
+    max_lat = max(all_lat) if all_lat else None
+
+    ep_tps = [s / DURATION_S for s in ep_success]
+    ep_tokps = [(t / DURATION_S) for t in ep_tokens] if ENABLE_TOKEN_METRICS else [None] * n
+    ep_avg_lat = [(sum(l)/len(l) if l else None) for l in ep_latencies]
+
+    return {
+        "concurrency": concurrency,
+        "max_tokens": max_tokens,
+
+        "total_success": total_success,
+        "total_fail": total_fail,
+
+        "tps_req_per_s": tps,
+        "tok_per_s": tok_per_s,
+
+        "avg_latency_s": avg_lat,
+        "max_latency_s": max_lat,
+
+        "ep_success": ep_success,
+        "ep_fail": ep_fail,
+        "ep_tps": ep_tps,
+        "ep_tokps": ep_tokps,
+        "ep_avg_lat": ep_avg_lat,
+    }
+
+
+# =========================
+# Repeat & aggregate
+# =========================
+def mean(vals):
+    vals = [v for v in vals if v is not None]
+    return sum(vals) / len(vals) if vals else None
+
+
+async def run_repeats(concurrency, repeats, max_tokens):
+    runs = []
+    for _ in range(repeats):
+        runs.append(await run_level(concurrency, max_tokens))
+    return runs
+
+
+def aggregate_runs(runs):
+    n = len(ENDPOINTS)
+    return {
+        "concurrency": runs[0]["concurrency"],
+        "max_tokens": runs[0]["max_tokens"],
+
+        "tps_req_per_s_mean": mean([r["tps_req_per_s"] for r in runs]),
+        "tok_per_s_mean": mean([r["tok_per_s"] for r in runs]),
+        "avg_latency_s_mean": mean([r["avg_latency_s"] for r in runs]),
+        "max_latency_s_mean": mean([r["max_latency_s"] for r in runs]),
+
+        **{f"ep{i}_tps_mean": mean([r["ep_tps"][i] for r in runs]) for i in range(n)},
+        **{f"ep{i}_tokps_mean": mean([r["ep_tokps"][i] for r in runs]) for i in range(n)},
+        **{f"ep{i}_avg_lat_mean": mean([r["ep_avg_lat"][i] for r in runs]) for i in range(n)},
+        **{f"ep{i}_fail_mean": mean([r["ep_fail"][i] for r in runs]) for i in range(n)},
+
+        "total_fail_mean": mean([r["total_fail"] for r in runs]),
+        "total_success_mean": mean([r["total_success"] for r in runs]),
+    }
+
+
+# =========================
+# IO & plotting
+# =========================
+def safe_format(x, digits=2):
+    return f"{x:.{digits}f}" if x is not None else "N/A"
+
+
+def save_csv(rows, path):
+    os.makedirs(os.path.dirname(path), exist_ok=True)
+    with open(path, "w", newline="", encoding="utf-8") as f:
+        w = csv.DictWriter(f, fieldnames=list(rows[0].keys()))
+        w.writeheader()
+        w.writerows(rows)
+
+
+def plot_overall_tps(rows, out_png, title_suffix=""):
+    conc = [r["concurrency"] for r in rows]
+    tps = [r["tps_req_per_s_mean"] for r in rows]
+    plt.figure()
+    plt.plot(conc, tps, marker="o")
+    plt.xlabel("Concurrency (total across 4 servers)")
+    plt.ylabel("TPS (requests/sec)")
+    plt.title(f"Aggregate TPS vs Concurrency (4 servers) - {MODEL_NAME}{title_suffix}")
+    plt.grid(True, linestyle="--", alpha=0.6)
+    os.makedirs(os.path.dirname(out_png), exist_ok=True)
+    plt.tight_layout(rect=[0, 0, 1, 0.95])
+    plt.savefig(out_png, dpi=150, bbox_inches="tight", pad_inches=0.2)
+    plt.close()
+
+
+def plot_latency(rows, out_png, title_suffix=""):
+    conc = [r["concurrency"] for r in rows]
+    avg_lat = [r["avg_latency_s_mean"] for r in rows]
+    max_lat = [r["max_latency_s_mean"] for r in rows]
+    plt.figure()
+    plt.plot(conc, avg_lat, marker="o", label="Avg latency")
+    plt.plot(conc, max_lat, marker="x", label="Max latency")
+    plt.xlabel("Concurrency (total across 4 servers)")
+    plt.ylabel("Latency (sec)")
+    plt.title(f"Latency vs Concurrency (4 servers) - {MODEL_NAME}{title_suffix}")
+    plt.legend()
+    plt.grid(True, linestyle="--", alpha=0.6)
+    os.makedirs(os.path.dirname(out_png), exist_ok=True)
+    plt.tight_layout(rect=[0, 0, 1, 0.95])
+    plt.savefig(out_png, dpi=150, bbox_inches="tight", pad_inches=0.2)
+    plt.close()
+
+
+def plot_per_endpoint_tps(rows, out_png, title_suffix="", jitter_x=True):
+    """
+    jitter_x=True will shift x slightly per endpoint so overlapping lines are visible.
+    """
+    conc = [r["concurrency"] for r in rows]
+    plt.figure()
+
+    offsets = [-0.30, -0.10, 0.10, 0.30]
+    styles = ["-", "--", "-.", ":"]
+
+    for i in range(len(ENDPOINTS)):
+        tps_i = [r[f"ep{i}_tps_mean"] for r in rows]
+        x_i = [(c + offsets[i]) for c in conc] if jitter_x else conc
+        plt.plot(x_i, tps_i, marker="o", linestyle=styles[i], linewidth=2, label=f"endpoint {i}")
+
+    plt.xlabel("Concurrency (total across 4 servers)")
+    plt.ylabel("TPS per endpoint (req/s)")
+    plt.title(f"Per-endpoint TPS vs Concurrency - {MODEL_NAME}{title_suffix}")
+    plt.legend()
+    plt.grid(True, linestyle="--", alpha=0.6)
+    os.makedirs(os.path.dirname(out_png), exist_ok=True)
+    plt.tight_layout(rect=[0, 0, 1, 0.95])
+    plt.savefig(out_png, dpi=150, bbox_inches="tight", pad_inches=0.2)
+    plt.close()
+
+
+def plot_multi_tokens_curves(all_rows_by_tokens, out_png, y_key, y_label, title):
+    """
+    Draw multiple lines (one per max_tokens) on the same chart.
+    y_key: e.g. "tps_req_per_s_mean" or "tok_per_s_mean"
+    """
+    plt.figure()
+    for max_tokens, rows in all_rows_by_tokens.items():
+        conc = [r["concurrency"] for r in rows]
+        y = [r[y_key] for r in rows]
+        plt.plot(conc, y, marker="o", label=f"max_tokens={max_tokens}")
+
+    plt.xlabel("Concurrency (total across 4 servers)")
+    plt.ylabel(y_label)
+    plt.title(title)
+    plt.legend()
+    plt.grid(True, linestyle="--", alpha=0.6)
+    os.makedirs(os.path.dirname(out_png), exist_ok=True)
+    plt.tight_layout(rect=[0, 0, 1, 0.95])
+    plt.savefig(out_png, dpi=150, bbox_inches="tight", pad_inches=0.2)
+    plt.close()
+
+
+# =========================
+# Main
+# =========================
+async def main():
+    print("Endpoints:")
+    for e in ENDPOINTS:
+        print("  ", e)
+    print(f"\nModel={MODEL_NAME}, duration={DURATION_S}s, warmup={WARMUP_S}s, repeats={REPEATS_PER_LEVEL}")
+    print(f"MAX_TOKENS_LIST={MAX_TOKENS_LIST}\n")
+
+    os.makedirs(OUT_DIR, exist_ok=True)
+
+    # Collect results for final multi-curve plots
+    all_rows_by_tokens = {}
+
+    for max_tokens in MAX_TOKENS_LIST:
+        print("\n==============================")
+        print(f" MAX_TOKENS = {max_tokens}")
+        print("==============================\n")
+
+        agg_rows = []
+
+        for c in CONCURRENCY_LEVELS:
+            runs = await run_repeats(c, REPEATS_PER_LEVEL, max_tokens)
+            agg = aggregate_runs(runs)
+            agg_rows.append(agg)
+
+            print(f"=== Concurrency {c} ===")
+            print(f"Aggregate TPS(mean): {safe_format(agg['tps_req_per_s_mean'])} req/s")
+            print(f"Tokens/s(mean):      {safe_format(agg['tok_per_s_mean'])}")
+            print(f"Avg latency(mean):   {safe_format(agg['avg_latency_s_mean'])} s")
+            print(f"Max latency(mean):   {safe_format(agg['max_latency_s_mean'])} s")
+            print(f"Total fail(mean):    {safe_format(agg['total_fail_mean'])}")
+            print("")
+
+        all_rows_by_tokens[max_tokens] = agg_rows
+
+        # Per max_tokens outputs
+        csv_path = f"{OUT_DIR}/multi_results_tok{max_tokens}_{RUN_TAG}.csv"
+        tps_png = f"{OUT_DIR}/multi_tps_tok{max_tokens}_{RUN_TAG}.png"
+        lat_png = f"{OUT_DIR}/multi_latency_tok{max_tokens}_{RUN_TAG}.png"
+        per_png = f"{OUT_DIR}/per_endpoint_tps_tok{max_tokens}_{RUN_TAG}.png"
+
+        save_csv(agg_rows, csv_path)
+        plot_overall_tps(agg_rows, tps_png, title_suffix=f" (max_tokens={max_tokens})")
+        plot_latency(agg_rows, lat_png, title_suffix=f" (max_tokens={max_tokens})")
+        plot_per_endpoint_tps(agg_rows, per_png, title_suffix=f" (max_tokens={max_tokens})", jitter_x=True)
+
+        print("Saved:")
+        print("  CSV :", csv_path)
+        print("  TPS :", tps_png)
+        print("  LAT :", lat_png)
+        print("  PER :", per_png)
+
+    # Final multi-curve summary plots (compare max_tokens)
+    multi_tps_png = f"{OUT_DIR}/compare_tps_multi_tokens_{RUN_TAG}.png"
+    multi_tokps_png = f"{OUT_DIR}/compare_tokps_multi_tokens_{RUN_TAG}.png"
+
+    plot_multi_tokens_curves(
+        all_rows_by_tokens,
+        multi_tps_png,
+        y_key="tps_req_per_s_mean",
+        y_label="TPS (requests/sec)",
+        title=f"Aggregate TPS vs Concurrency (compare max_tokens) - {MODEL_NAME}"
+    )
+
+    plot_multi_tokens_curves(
+        all_rows_by_tokens,
+        multi_tokps_png,
+        y_key="tok_per_s_mean",
+        y_label="Tokens/sec",
+        title=f"Tokens/sec vs Concurrency (compare max_tokens) - {MODEL_NAME}"
+    )
+
+    print("\nSaved multi-curve comparison plots:")
+    print("  TPS compare :", multi_tps_png)
+    print("  Tok/s compare:", multi_tokps_png)
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
\ No newline at end of file
diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/docker-images-show.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/docker-images-show.png
new file mode 100644
index 0000000..d9c285c
Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/docker-images-show.png differ
diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/huggingface_cli_login.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/huggingface_cli_login.png
new file mode 100644
index 0000000..5ec0507
Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/huggingface_cli_login.png differ
diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/multi_tps_tok128.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/multi_tps_tok128.png
new file mode 100644
index 0000000..587a424
Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/multi_tps_tok128.png differ
diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/per_endpoint_tps_tok128.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/per_endpoint_tps_tok128.png
new file mode 100644
index 0000000..c2f6c36
Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/per_endpoint_tps_tok128.png differ
diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/qaic-devices-status.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/qaic-devices-status.png
new file mode 100644
index 0000000..1e4bc7d
Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/qaic-devices-status.png differ
diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/server-start-log.png b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/server-start-log.png
new file mode 100644
index 0000000..a02beb5
Binary files /dev/null and b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/images/server-start-log.png differ
diff --git a/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/project.yaml b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/project.yaml
new file mode 100644
index 0000000..18eb795
--- /dev/null
+++ b/GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/project.yaml
@@ -0,0 +1,11 @@
+name: "vLLM QAIC Concurrency Benchmark on AIC100"
+description: "Provides a reference benchmark for evaluating vLLM serving performance on Qualcomm AIC100 (QAIC), focusing on concurrency scaling, throughput, and multi-request LLM inference behavior."
+category:
+  - "GenAI"
+  - "CloudAI-Playground"
+platforms:
+  - "AIC100 Ultra"
+tags:
+  - "vLLM"
+  - "LLM Benchmark"
+  - "QAIC"
\ No newline at end of file
diff --git a/README.md b/README.md
index ebc5ec6..6f68208 100644
--- a/README.md
+++ b/README.md
@@ -191,6 +191,9 @@ This repository is a collection of demo applications that highlight the capabili
       </a><br>
       <a href="./GenAI/CloudAI-Playground/traffic_ai_agent/" title="An on‑prem traffic AI demo with real‑time decisions and LLM explanations on Cloud AI 100 Ultra.">
         <img src="https://img.shields.io/badge/Traffic_AI_Agent_–_Dual_Intersection_(On‑Prem,_Analysis)-2026.05.04-grey?style=flat-square&labelColor=brightgreen" alt="Traffic AI Agent – Dual Intersection (On‑Prem, Analysis)"/>
+      </a><br>
+      <a href="./GenAI/CloudAI-Playground/vLLM_QAIC_Concurrency_Benchmark_on_AIC100/" title="Provides a reference benchmark for evaluating vLLM serving performance on Qualcomm AIC100 (QAIC), focusing on concurrency scaling, throughput, and multi-request LLM inference behavior.">
+        <img src="https://img.shields.io/badge/vLLM_QAIC_Concurrency_Benchmark_on_AIC100-2026.06.04-grey?style=flat-square&labelColor=brightgreen" alt="vLLM QAIC Concurrency Benchmark on AIC100"/>
       </a>
     </td>
     <td align="center">—</td>