Benchmarking vLLM Production Stack Performance with Different ca

This benchmark measures offloading times between CPU and GPU with different implementation.

1. vLLM's swap_blocks

python3 offloading_benchmarks.py --case=vllm

2. LMcache multi_layer_kv_transfer

In the directory, run the following command to build the cuda code:

python3 install setpy.py

Then run the script:

python3 offloading_benchmarks.py --case=lmcache

3. Other variations of LMcache multi_layer_kv_transfer

You can implement other variatins of the multi_layer_kv_transfer function, and test variations that I've wrote. Notice that multi_layer_kv_transfer calls the kernel function load_and_reshape_multi_layer_kernel. All the implementations are in mem_kernels.cu. You can choose to replace each one.

The ones I've implemented:

multi_layer_kv_transfer_Memcpy: using memcopy on each token in each layer.
load_and_reshape_multi_layer_kernel_memcpy: kernel function for multi_layer_kv_transfer_Memcpy.
load_and_reshape_multi_layer_kernel_int4: copying 16 bytes in each thread iteration.
load_and_reshape_multi_layer_kernel_test: improving arithmetic calculation.

Switch the original the call for the original funcion with the new implementation you desire, and then recompile the cuda code::

rm -rf build *.egg-info *.so
python3 install setpy.py

Important: If you write a new implementation to multi_layer_kv_transfer then you also need to do the following:

Add a declaration in offloading_mem_kernels.cuh

void single_layer_kv_transfer(torch::Tensor& lmc_key_value_cache,
                              torch::Tensor& vllm_key_cache,
                              torch::Tensor& vllm_value_cache,
                              torch::Tensor& slot_mapping,
                              const bool direction);

Add a declaration in lmc_ops.cpp

m.def("multi_layer_kv_transfer", &multi_layer_kv_transfer);

Switch the call form the function in offloading_benchmarks.py

lmc_ops.multi_layer_kv_transfer_Memcpy(
        memlist[i],
        pointers_dst,
        slot_mapping[start:end],
        gpu_kv_dst[0].device,
        page_buffer_size,                 
        False,                             
        use_mla,                           
    )

4. Plot results

Use functions from offloading_benchmarks_plotter.py to plot the results. each function has a different outcome.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
lmc_ops.cpp		lmc_ops.cpp
mem_kernels.cu		mem_kernels.cu
mem_kernels.cuh		mem_kernels.cuh
offloading_benchmarks.py		offloading_benchmarks.py
offloading_benchmarks_plotter.py		offloading_benchmarks_plotter.py
setup.py		setup.py
vllm_offload_benchmark.yaml		vllm_offload_benchmark.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking vLLM Production Stack Performance with Different ca

1. vLLM's swap_blocks

2. LMcache multi_layer_kv_transfer

3. Other variations of LMcache multi_layer_kv_transfer

4. Plot results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmarking vLLM Production Stack Performance with Different ca

1. vLLM's swap_blocks

2. LMcache multi_layer_kv_transfer

3. Other variations of LMcache multi_layer_kv_transfer

4. Plot results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages