Skip to content

Conversation

@alizaidis
Copy link
Collaborator

Added initial examples for vllm speculative decoding with n-gram and eagle methods. We can update with MLP Speculators and benchmarks against base vLLM for each config once we have the measurement flows finalized.

Comment on lines +37 to +57
containers:
- args:
- |
echo "########### $(date) - Starting parallel-fetch-safetensors for model: ${MODEL_ID}"
ls -alR /gcs
find /gcs/${MODEL_ID}/*safetensors -type f | xargs -I {} -P 15 sh -c 'echo "########### $(date) - Fetching: {}"; dd if={} of=/dev/null'
echo "########### $(date) - Finished parallel-fetch-safetensors"
sleep infinity
command: ["/bin/sh", "-c"]
env:
- name: MODEL_ID
valueFrom:
configMapKeyRef:
key: MODEL_ID
name: runtime
image: busybox
name: fetch-safetensors
volumeMounts:
- mountPath: /gcs
name: huggingface-hub-model-bucket
readOnly: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we're now recommending using Run:ai Model Streamer in the best practices doc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we've tested and verified runai-model-streamer, I don't think we can make a hard recommendation on just using it. It is currently only supports primarily vLLM, so some customers may look for a solution that can used across inference servers.

driver: gcsfuse.csi.storage.gke.io
volumeAttributes:
bucketName: cloud-storage-bucket-name
mountOptions: metadata-cache:ttl-secs:-1,metadata-cache:stat-cache-max-size-mb:-1,metadata-cache:type-cache-max-size-mb:-1,metadata-cache:negative-ttl-secs:0,file-cache:max-size-mb:-1,file-cache:cache-file-for-range-read:true,file-cache:enable-parallel-downloads:true,implicit-dirs,file-system:kernel-list-cache-ttl-secs:-1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add only-dir?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, two models are needed. So either two separate mounts would need to be created or we'd need to leave off only-dir.

@ferrarimarco
Copy link
Member

@alizaidis I didn't complete a full test round yet. @syeda-anjum do you have time for that? Thanks

Copy link
Collaborator

@arueth arueth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also rebase from main and fix the PR checks.

Speculative decoding is a powerful optimization technique that enhances LLM inference speed without compromising output quality. It utilizes a smaller, faster "draft" model or method to generate candidate tokens, which are then validated by the main, larger "target" model in a single, efficient step. This reduces the computational overhead and improves both throughput and inter-token latency.
vLLM supports several speculative decoding methods, each tailored to different use cases and performance requirements. See the [Speculative Decoding guide](https://docs.vllm.ai/en/v0.11.0/features/spec_decode.html) in the official vLLM docs for in depth concepts and examples. This guide will walk you through the implementation of the following Speculative Decoding methods with vLLM on GKE:

- [N-gram Based Speculative Decoding](https://docs.vllm.ai/en/v0.11.0/features/spec_decode.html#speculating-by-matching-n-grams-in-the-prompt)
Copy link
Collaborator

@arueth arueth Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to pin to v0.11.0 of the docs or use stable?


This method is particularly effective for tasks where the output is likely to contain sequences from the input prompt, such as summarization or question-answering. Instead of a draft model, it uses n-grams from the prompt to generate token proposals.

- [EAGLE Based Draft Models](https://docs.vllm.ai/en/v0.11.0/features/spec_decode.html#speculating-using-eagle-based-draft-models)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to pin to v0.11.0 of the docs or use latest?

@@ -0,0 +1 @@
CONTAINER_IMAGE_URL=docker.io/vllm/vllm-openai:v0.11.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file shouldn't be committed.

[Allocation quotas: GPU quota](https://cloud.google.com/compute/resource-usage#gpu_quota).

The Kubernetes manifests invoked below are based on the
[Inference Quickstart recommendations](https://cloud.google.com/kubernetes-engine/docs/how-to/machine-learning/inference-quickstart).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true?

Comment on lines +147 to +148
| gemma-3-27b-it | ✅ | ✅ |
| llama-3.3-70b-instruct | ✅ | ✅ |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no h200 manifests.

Start a port forward to the model service.

```shell
kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} port-forward service/vllm-h100-llama-3-70b-it-sd-eagle 8000:8000 >/dev/null & \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be parameterized for model and accelerator.

- Delete the workload.

```shell
kubectl delete --ignore-not-found --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm-spec-decoding/h100-llama-3-70b-it-sd-eagle"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be parameterized for model and accelerator.


nameSuffix: -h100-gemma-3-27b-it-sd-ngram


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra newline

name: vllm
path: patch-spec-decoding.yaml


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra newline

template:
spec:
nodeSelector:
cloud.google.com/compute-class: gpu-h100-80gb-high-x1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is one h100 sufficient?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants