Benchmarking vLLM Production Stack Performance with Different ca

This benchmark compares retrival times from different components with different caching configurations. In each iteration there are 2 requests sent with the same prompt, and the timings of both requests.

GPU miss, no CPU: without enable_prefix_caching, therefore both requests with the same prompt are not cached in GPU.
GPU hit: with enable_prefix_caching, in the first request the prompt is cached in GPU, and the second is GPU hit.
GPU miss, CPU hit: first request is cached in both GPU and CPU, then GPU cashed is filled, so the second request is a miss in GPU and hit in CPU.

1. 🛠️ Deploy the `python-runner.yaml`

Apply the pod using:

oc apply -f pods_configuration/python-runner.yaml

2. 🛠️ Deploy the required pod, based on the desired result

helm install vllm vllm/vllm-stack -f pods_configuration/<congifuration_name>.yaml

3. Forward the port:

oc port-forward svc/vllm-router-service 30080:80

4. Run single_requests.py with the wanted parameters

python3 single_requests.py --case=gpu_miss --rounds=10 --tokens=128

Choose the number of tokens in the prompt from [128, 1000, 4096]. Choose the case to be from [gpu_miss, gpu_hit, cpu_hit]

5. plot the results with single_requests_plotter.py

The script plots all .json results files from a given folder, first and second requests. It drops the first iteration of the experiment.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
pods_configuration		pods_configuration
README.md		README.md
single_requests.py		single_requests.py
single_requests_plotter.py		single_requests_plotter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking vLLM Production Stack Performance with Different ca

1. 🛠️ Deploy the `python-runner.yaml`

2. 🛠️ Deploy the required pod, based on the desired result

3. Forward the port:

4. Run single_requests.py with the wanted parameters

5. plot the results with single_requests_plotter.py

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmarking vLLM Production Stack Performance with Different ca

1. 🛠️ Deploy the python-runner.yaml

2. 🛠️ Deploy the required pod, based on the desired result

3. Forward the port:

4. Run single_requests.py with the wanted parameters

5. plot the results with single_requests_plotter.py

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. 🛠️ Deploy the `python-runner.yaml`

Packages