Added examples for vllm speculative decoding with n-gram and eagle methods #332

alizaidis · 2025-11-13T01:16:10Z

Added initial examples for vllm speculative decoding with n-gram and eagle methods. We can update with MLP Speculators and benchmarks against base vLLM for each config once we have the measurement flows finalized.

…thods.

ferrarimarco · 2025-11-13T07:05:35Z

...e-ref-arch/kubernetes-manifests/online-inference-gpu/vllm-spec-decoding/base/deployment.yaml

+      containers:
+        - args:
+            - |
+              echo "########### $(date) - Starting parallel-fetch-safetensors for model: ${MODEL_ID}"
+              ls -alR /gcs
+              find /gcs/${MODEL_ID}/*safetensors -type f | xargs -I {} -P 15 sh -c 'echo "########### $(date) - Fetching: {}"; dd if={} of=/dev/null'
+              echo "########### $(date) - Finished parallel-fetch-safetensors"
+              sleep infinity
+          command: ["/bin/sh", "-c"]
+          env:
+            - name: MODEL_ID
+              valueFrom:
+                configMapKeyRef:
+                  key: MODEL_ID
+                  name: runtime
+          image: busybox
+          name: fetch-safetensors
+          volumeMounts:
+            - mountPath: /gcs
+              name: huggingface-hub-model-bucket
+              readOnly: true


I think that we're now recommending using Run:ai Model Streamer in the best practices doc.

Until we've tested and verified runai-model-streamer, I don't think we can make a hard recommendation on just using it. It is currently only supports primarily vLLM, so some customers may look for a solution that can used across inference servers.

ferrarimarco · 2025-11-13T07:06:14Z

...e-ref-arch/kubernetes-manifests/online-inference-gpu/vllm-spec-decoding/base/deployment.yaml

+            driver: gcsfuse.csi.storage.gke.io
+            volumeAttributes:
+              bucketName: cloud-storage-bucket-name
+              mountOptions: metadata-cache:ttl-secs:-1,metadata-cache:stat-cache-max-size-mb:-1,metadata-cache:type-cache-max-size-mb:-1,metadata-cache:negative-ttl-secs:0,file-cache:max-size-mb:-1,file-cache:cache-file-for-range-read:true,file-cache:enable-parallel-downloads:true,implicit-dirs,file-system:kernel-list-cache-ttl-secs:-1


Need to add only-dir?

In the current implementation, two models are needed. So either two separate mounts would need to be created or we'd need to leave off only-dir.

ferrarimarco · 2025-11-13T07:11:59Z

@alizaidis I didn't complete a full test round yet. @syeda-anjum do you have time for that? Thanks

arueth

Can you also rebase from main and fix the PR checks.

arueth · 2025-11-13T14:57:37Z

...e/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md

+Speculative decoding is a powerful optimization technique that enhances LLM inference speed without compromising output quality. It utilizes a smaller, faster "draft" model or method to generate candidate tokens, which are then validated by the main, larger "target" model in a single, efficient step. This reduces the computational overhead and improves both throughput and inter-token latency. 
+vLLM supports several speculative decoding methods, each tailored to different use cases and performance requirements. See the [Speculative Decoding guide](https://docs.vllm.ai/en/v0.11.0/features/spec_decode.html) in the official vLLM docs for in depth concepts and examples. This guide will walk you through the implementation of the following Speculative Decoding methods with vLLM on GKE:
+
+- [N-gram Based Speculative Decoding](https://docs.vllm.ai/en/v0.11.0/features/spec_decode.html#speculating-by-matching-n-grams-in-the-prompt)


Do you want to pin to v0.11.0 of the docs or use stable?

arueth · 2025-11-13T14:57:50Z

...e/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md

+
+  This method is particularly effective for tasks where the output is likely to contain sequences from the input prompt, such as summarization or question-answering. Instead of a draft model, it uses n-grams from the prompt to generate token proposals.
+
+- [EAGLE Based Draft Models](https://docs.vllm.ai/en/v0.11.0/features/spec_decode.html#speculating-using-eagle-based-draft-models)


Do you want to pin to v0.11.0 of the docs or use latest?

arueth · 2025-11-13T14:59:24Z

...nference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm-spec-decoding/base/vllm.env

@@ -0,0 +1 @@
+CONTAINER_IMAGE_URL=docker.io/vllm/vllm-openai:v0.11.0


This file shouldn't be committed.

arueth · 2025-11-13T15:00:22Z

...e/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md

+    [Allocation quotas: GPU quota](https://cloud.google.com/compute/resource-usage#gpu_quota).
+
+    The Kubernetes manifests invoked below are based on the
+    [Inference Quickstart recommendations](https://cloud.google.com/kubernetes-engine/docs/how-to/machine-learning/inference-quickstart).


Is this true?

arueth · 2025-11-13T15:01:33Z

...e/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md

+    | gemma-3-27b-it         | ✅   | ✅   |
+    | llama-3.3-70b-instruct | ✅   | ✅   |


There are no h200 manifests.

arueth · 2025-11-13T15:04:03Z

...e/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md

+  Start a port forward to the model service.
+
+  ```shell
+  kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} port-forward service/vllm-h100-llama-3-70b-it-sd-eagle 8000:8000 >/dev/null & \


This probably needs to be parameterized for model and accelerator.

arueth · 2025-11-13T15:04:10Z

...e/base/use-cases/inference-ref-arch/online-inference-gpu/vllm-spec-decoding-with-hf-model.md

+- Delete the workload.
+
+  ```shell
+  kubectl delete --ignore-not-found --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm-spec-decoding/h100-llama-3-70b-it-sd-eagle"


This probably needs to be parameterized for model and accelerator.

arueth · 2025-11-13T15:05:22Z

...ests/online-inference-gpu/vllm-spec-decoding/h100-gemma-3-27b-it-sd-ngram/kustomization.yaml

+
+nameSuffix: -h100-gemma-3-27b-it-sd-ngram
+
+


Extra newline

arueth · 2025-11-13T15:05:36Z

...ests/online-inference-gpu/vllm-spec-decoding/h100-gemma-3-27b-it-sd-ngram/kustomization.yaml

+      name: vllm
+    path: patch-spec-decoding.yaml
+
+


Extra newline

arueth · 2025-11-13T15:06:14Z

...online-inference-gpu/vllm-spec-decoding/h100-gemma-3-27b-it-sd-ngram/patch-nodeselector.yaml

+  template:
+    spec:
+      nodeSelector:
+        cloud.google.com/compute-class: gpu-h100-80gb-high-x1


Is one h100 sufficient?

Added examples for vllm speculative decoding with n-gram and eagle me…

045127e

…thods.

alizaidis requested review from arueth, ferrarimarco and syeda-anjum November 13, 2025 01:16

ferrarimarco reviewed Nov 13, 2025

View reviewed changes

ferrarimarco approved these changes Nov 13, 2025

View reviewed changes

arueth requested changes Nov 13, 2025

View reviewed changes


		This method is particularly effective for tasks where the output is likely to contain sequences from the input prompt, such as summarization or question-answering. Instead of a draft model, it uses n-grams from the prompt to generate token proposals.

		- [EAGLE Based Draft Models](https://docs.vllm.ai/en/v0.11.0/features/spec_decode.html#speculating-using-eagle-based-draft-models)

		@@ -0,0 +1 @@
		CONTAINER_IMAGE_URL=docker.io/vllm/vllm-openai:v0.11.0

		\| gemma-3-27b-it \| ✅ \| ✅ \|
		\| llama-3.3-70b-instruct \| ✅ \| ✅ \|


		nameSuffix: -h100-gemma-3-27b-it-sd-ngram

Added examples for vllm speculative decoding with n-gram and eagle methods #332

Are you sure you want to change the base?

Added examples for vllm speculative decoding with n-gram and eagle methods #332

Uh oh!

Conversation

alizaidis commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ferrarimarco commented Nov 13, 2025

Uh oh!

arueth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arueth Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arueth left a comment •

edited

Loading

arueth Nov 13, 2025 •

edited

Loading