-
Notifications
You must be signed in to change notification settings - Fork 25
Added examples for vllm speculative decoding with n-gram and eagle methods #332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| containers: | ||
| - args: | ||
| - | | ||
| echo "########### $(date) - Starting parallel-fetch-safetensors for model: ${MODEL_ID}" | ||
| ls -alR /gcs | ||
| find /gcs/${MODEL_ID}/*safetensors -type f | xargs -I {} -P 15 sh -c 'echo "########### $(date) - Fetching: {}"; dd if={} of=/dev/null' | ||
| echo "########### $(date) - Finished parallel-fetch-safetensors" | ||
| sleep infinity | ||
| command: ["/bin/sh", "-c"] | ||
| env: | ||
| - name: MODEL_ID | ||
| valueFrom: | ||
| configMapKeyRef: | ||
| key: MODEL_ID | ||
| name: runtime | ||
| image: busybox | ||
| name: fetch-safetensors | ||
| volumeMounts: | ||
| - mountPath: /gcs | ||
| name: huggingface-hub-model-bucket | ||
| readOnly: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we're now recommending using Run:ai Model Streamer in the best practices doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until we've tested and verified runai-model-streamer, I don't think we can make a hard recommendation on just using it. It is currently only supports primarily vLLM, so some customers may look for a solution that can used across inference servers.
| driver: gcsfuse.csi.storage.gke.io | ||
| volumeAttributes: | ||
| bucketName: cloud-storage-bucket-name | ||
| mountOptions: metadata-cache:ttl-secs:-1,metadata-cache:stat-cache-max-size-mb:-1,metadata-cache:type-cache-max-size-mb:-1,metadata-cache:negative-ttl-secs:0,file-cache:max-size-mb:-1,file-cache:cache-file-for-range-read:true,file-cache:enable-parallel-downloads:true,implicit-dirs,file-system:kernel-list-cache-ttl-secs:-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add only-dir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, two models are needed. So either two separate mounts would need to be created or we'd need to leave off only-dir.
|
@alizaidis I didn't complete a full test round yet. @syeda-anjum do you have time for that? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also rebase from main and fix the PR checks.
| Speculative decoding is a powerful optimization technique that enhances LLM inference speed without compromising output quality. It utilizes a smaller, faster "draft" model or method to generate candidate tokens, which are then validated by the main, larger "target" model in a single, efficient step. This reduces the computational overhead and improves both throughput and inter-token latency. | ||
| vLLM supports several speculative decoding methods, each tailored to different use cases and performance requirements. See the [Speculative Decoding guide](https://docs.vllm.ai/en/v0.11.0/features/spec_decode.html) in the official vLLM docs for in depth concepts and examples. This guide will walk you through the implementation of the following Speculative Decoding methods with vLLM on GKE: | ||
|
|
||
| - [N-gram Based Speculative Decoding](https://docs.vllm.ai/en/v0.11.0/features/spec_decode.html#speculating-by-matching-n-grams-in-the-prompt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to pin to v0.11.0 of the docs or use stable?
|
|
||
| This method is particularly effective for tasks where the output is likely to contain sequences from the input prompt, such as summarization or question-answering. Instead of a draft model, it uses n-grams from the prompt to generate token proposals. | ||
|
|
||
| - [EAGLE Based Draft Models](https://docs.vllm.ai/en/v0.11.0/features/spec_decode.html#speculating-using-eagle-based-draft-models) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to pin to v0.11.0 of the docs or use latest?
| @@ -0,0 +1 @@ | |||
| CONTAINER_IMAGE_URL=docker.io/vllm/vllm-openai:v0.11.0 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file shouldn't be committed.
| [Allocation quotas: GPU quota](https://cloud.google.com/compute/resource-usage#gpu_quota). | ||
|
|
||
| The Kubernetes manifests invoked below are based on the | ||
| [Inference Quickstart recommendations](https://cloud.google.com/kubernetes-engine/docs/how-to/machine-learning/inference-quickstart). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true?
| | gemma-3-27b-it | ✅ | ✅ | | ||
| | llama-3.3-70b-instruct | ✅ | ✅ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no h200 manifests.
| Start a port forward to the model service. | ||
|
|
||
| ```shell | ||
| kubectl --namespace=${ira_online_gpu_kubernetes_namespace_name} port-forward service/vllm-h100-llama-3-70b-it-sd-eagle 8000:8000 >/dev/null & \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably needs to be parameterized for model and accelerator.
| - Delete the workload. | ||
|
|
||
| ```shell | ||
| kubectl delete --ignore-not-found --kustomize "${ACP_REPO_DIR}/platforms/gke/base/use-cases/inference-ref-arch/kubernetes-manifests/online-inference-gpu/vllm-spec-decoding/h100-llama-3-70b-it-sd-eagle" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably needs to be parameterized for model and accelerator.
|
|
||
| nameSuffix: -h100-gemma-3-27b-it-sd-ngram | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra newline
| name: vllm | ||
| path: patch-spec-decoding.yaml | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra newline
| template: | ||
| spec: | ||
| nodeSelector: | ||
| cloud.google.com/compute-class: gpu-h100-80gb-high-x1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is one h100 sufficient?
Added initial examples for vllm speculative decoding with n-gram and eagle methods. We can update with MLP Speculators and benchmarks against base vLLM for each config once we have the measurement flows finalized.