Skip to content

rshavitt/cache_retrival_time

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Benchmarking vLLM Production Stack Performance with Different ca

This benchmark compares retrival times from different components with different caching configurations. In each iteration there are 2 requests sent with the same prompt, and the timings of both requests.

  • GPU miss, no CPU: without enable_prefix_caching, therefore both requests with the same prompt are not cached in GPU.
  • GPU hit: with enable_prefix_caching, in the first request the prompt is cached in GPU, and the second is GPU hit.
  • GPU miss, CPU hit: first request is cached in both GPU and CPU, then GPU cashed is filled, so the second request is a miss in GPU and hit in CPU.

1. 🛠️ Deploy the python-runner.yaml

Apply the pod using:

oc apply -f pods_configuration/python-runner.yaml

2. 🛠️ Deploy the required pod, based on the desired result

helm install vllm vllm/vllm-stack -f pods_configuration/<congifuration_name>.yaml

3. Forward the port:

oc port-forward svc/vllm-router-service 30080:80

4. Run single_requests.py with the wanted parameters

python3 single_requests.py --case=gpu_miss --rounds=10 --tokens=128

Choose the number of tokens in the prompt from [128, 1000, 4096]. Choose the case to be from [gpu_miss, gpu_hit, cpu_hit]

5. plot the results with single_requests_plotter.py

The script plots all .json results files from a given folder, first and second requests. It drops the first iteration of the experiment.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages