- Ensure that Fluxon is cloned and installed in the machine.
- Run
pip install -e .to install LMCache-Fluxon - Start Fluxon/Mooncake storage service (if needed).
- Mooncake:
./start_mooncake.sh(master) andpython3 mooncake_server.py(server).
- Mooncake:
- Ensure that vLLM is installed.
- Create a launcher config (copy from
start_vllm.example.yaml) and edit at leastmodel(andtp/ppif needed).- If you want Ray multi-node: start from
configs/ray/head_redis.yaml+configs/ray/worker_redis.yaml(or*_valkey.yaml,*_mooncake.yaml). lmcache_config_fileis required and should point to an LMCache YAML (e.g.configs/lmcache/redis.yaml) that setsremote_url.
- If you want Ray multi-node: start from
- Start vLLM with LMCache:
./start_vllm.py -c <YOUR_CONFIG.yaml> - Wait for vLLM starting.
- Run
ttft-estimator.pyfor test.
Notes:
- Multi-GPU (single node): set
CUDA_VISIBLE_DEVICES=0,1,...and settp: <GPU_COUNT>in the launcher config. - Multi-node (Ray): set
ray.mode: headin the head config; in worker configs setray.mode: workerand eitherray.address: <head_ip:port>orray.head_host: <head_ip>+ray.port: <port>. - If the machine is shared by multiple users, consider setting
ray.temp_dirin the launcher config to avoid/tmp/raypermission conflicts (default is/tmp/ray-$USER-lmcache-$PORT).
Quick Ray test (2 nodes):
- Head node:
- In config:
ray.mode: head - Run:
./start_vllm.py -c <HEAD_CONFIG.yaml>
- In config:
- Worker node(s):
- In config:
ray.mode: workerandray.address: <HEAD_IP>:6379 - Run:
./start_vllm.py -c <WORKER_CONFIG.yaml>
- In config: