This benchmark compares retrival times from different components with different caching configurations. In each iteration there are 2 requests sent with the same prompt, and the timings of both requests.
- GPU miss, no CPU: without enable_prefix_caching, therefore both requests with the same prompt are not cached in GPU.
- GPU hit: with enable_prefix_caching, in the first request the prompt is cached in GPU, and the second is GPU hit.
- GPU miss, CPU hit: first request is cached in both GPU and CPU, then GPU cashed is filled, so the second request is a miss in GPU and hit in CPU.
Apply the pod using:
oc apply -f pods_configuration/python-runner.yamlhelm install vllm vllm/vllm-stack -f pods_configuration/<congifuration_name>.yamloc port-forward svc/vllm-router-service 30080:80python3 single_requests.py --case=gpu_miss --rounds=10 --tokens=128Choose the number of tokens in the prompt from [128, 1000, 4096]. Choose the case to be from [gpu_miss, gpu_hit, cpu_hit]
The script plots all .json results files from a given folder, first and second requests. It drops the first iteration of the experiment.