This benchmark measures offloading times between CPU and GPU with different implementation.
python3 offloading_benchmarks.py --case=vllmIn the directory, run the following command to build the cuda code:
python3 install setpy.pyThen run the script:
python3 offloading_benchmarks.py --case=lmcacheYou can implement other variatins of the multi_layer_kv_transfer function, and test variations that I've wrote. Notice that multi_layer_kv_transfer calls the kernel function load_and_reshape_multi_layer_kernel. All the implementations are in mem_kernels.cu. You can choose to replace each one.
The ones I've implemented:
- multi_layer_kv_transfer_Memcpy: using memcopy on each token in each layer.
- load_and_reshape_multi_layer_kernel_memcpy: kernel function for multi_layer_kv_transfer_Memcpy.
- load_and_reshape_multi_layer_kernel_int4: copying 16 bytes in each thread iteration.
- load_and_reshape_multi_layer_kernel_test: improving arithmetic calculation.
Switch the original the call for the original funcion with the new implementation you desire, and then recompile the cuda code::
rm -rf build *.egg-info *.so
python3 install setpy.pyImportant: If you write a new implementation to multi_layer_kv_transfer then you also need to do the following:
- Add a declaration in offloading_mem_kernels.cuh
void single_layer_kv_transfer(torch::Tensor& lmc_key_value_cache,
torch::Tensor& vllm_key_cache,
torch::Tensor& vllm_value_cache,
torch::Tensor& slot_mapping,
const bool direction);- Add a declaration in lmc_ops.cpp
m.def("multi_layer_kv_transfer", &multi_layer_kv_transfer);- Switch the call form the function in offloading_benchmarks.py
lmc_ops.multi_layer_kv_transfer_Memcpy(
memlist[i],
pointers_dst,
slot_mapping[start:end],
gpu_kv_dst[0].device,
page_buffer_size,
False,
use_mla,
)Use functions from offloading_benchmarks_plotter.py to plot the results. each function has a different outcome.