Skip to content

rshavitt/offloading_benchmarks

Repository files navigation

Benchmarking vLLM Production Stack Performance with Different ca

This benchmark measures offloading times between CPU and GPU with different implementation.

1. vLLM's swap_blocks

python3 offloading_benchmarks.py --case=vllm

2. LMcache multi_layer_kv_transfer

In the directory, run the following command to build the cuda code:

python3 install setpy.py

Then run the script:

python3 offloading_benchmarks.py --case=lmcache

3. Other variations of LMcache multi_layer_kv_transfer

You can implement other variatins of the multi_layer_kv_transfer function, and test variations that I've wrote. Notice that multi_layer_kv_transfer calls the kernel function load_and_reshape_multi_layer_kernel. All the implementations are in mem_kernels.cu. You can choose to replace each one.

The ones I've implemented:

  • multi_layer_kv_transfer_Memcpy: using memcopy on each token in each layer.
  • load_and_reshape_multi_layer_kernel_memcpy: kernel function for multi_layer_kv_transfer_Memcpy.
  • load_and_reshape_multi_layer_kernel_int4: copying 16 bytes in each thread iteration.
  • load_and_reshape_multi_layer_kernel_test: improving arithmetic calculation.

Switch the original the call for the original funcion with the new implementation you desire, and then recompile the cuda code::

rm -rf build *.egg-info *.so
python3 install setpy.py

Important: If you write a new implementation to multi_layer_kv_transfer then you also need to do the following:

  1. Add a declaration in offloading_mem_kernels.cuh
void single_layer_kv_transfer(torch::Tensor& lmc_key_value_cache,
                              torch::Tensor& vllm_key_cache,
                              torch::Tensor& vllm_value_cache,
                              torch::Tensor& slot_mapping,
                              const bool direction);
  1. Add a declaration in lmc_ops.cpp
m.def("multi_layer_kv_transfer", &multi_layer_kv_transfer);
  1. Switch the call form the function in offloading_benchmarks.py
lmc_ops.multi_layer_kv_transfer_Memcpy(
        memlist[i],
        pointers_dst,
        slot_mapping[start:end],
        gpu_kv_dst[0].device,
        page_buffer_size,                 
        False,                             
        use_mla,                           
    )

4. Plot results

Use functions from offloading_benchmarks_plotter.py to plot the results. each function has a different outcome.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors