@feuler Thanks for the great work - I am trying to run LLAMA 70b with the unified memory patch. I downloaded the build wheels from the release page and ran it in a docker container. I tested with , export PYTORCH_CUDA_ALLOC_CONF='use_uvm:True,uvm_oversubscription_ratio:5.0,uvm_access_pattern:balanced' ' and export PYTORCH_CUDA_ALLOC_CONF='use_uvm:True,uvm_oversubscription_ratio:5.0,uvm_access_pattern:gpu_first' '
vLLM loads the model but fails during cache blocks. Below is the error message.
INFO 02-10 15:24:40 model_runner.py:1099] Loading model weights took 131.4185 GB
INFO 02-10 15:25:25 worker.py:241] Memory profiling takes 45.33 seconds
INFO 02-10 15:25:25 worker.py:241] the current vLLM instance can use total_gpu_memory (94.50GiB) x gpu_memory_utilization (0.90) = 85.05GiB
INFO 02-10 15:25:25 worker.py:241] model weights take 131.42GiB; non_torch_memory takes -39.47GiB; PyTorch activation peak memory takes 1.72GiB; the rest of the memory reserved for KV Cache is -8.62GiB.
INFO 02-10 15:25:26 gpu_executor.py:76] # GPU blocks: 0, # CPU blocks: 819
INFO 02-10 15:25:26 gpu_executor.py:80] Maximum concurrency for 8192 tokens per request: 0.00x
ERROR 02-10 15:25:26 engine.py:366] No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
ERROR 02-10 15:25:26 engine.py:366] Traceback (most recent call last):
ERROR 02-10 15:25:26 engine.py:366] File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
ERROR 02-10 15:25:26 engine.py:366] engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 02-10 15:25:26 engine.py:366] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-10 15:25:26 engine.py:366] File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
ERROR 02-10 15:25:26 engine.py:366] return cls(ipc_path=ipc_path,
ERROR 02-10 15:25:26 engine.py:366] ^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-10 15:25:26 engine.py:366] File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
ERROR 02-10 15:25:26 engine.py:366] self.engine = LLMEngine(*args, **kwargs)
ERROR 02-10 15:25:26 engine.py:366] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-10 15:25:26 engine.py:366] File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 276, in __init__
ERROR 02-10 15:25:26 engine.py:366] self._initialize_kv_caches()
ERROR 02-10 15:25:26 engine.py:366] File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 429, in _initialize_kv_caches
ERROR 02-10 15:25:26 engine.py:366] self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 02-10 15:25:26 engine.py:366] File "/usr/local/lib/python3.11/dist-packages/vllm/executor/gpu_executor.py", line 83, in initialize_cache
ERROR 02-10 15:25:26 engine.py:366] self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 02-10 15:25:26 engine.py:366] File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 266, in initialize_cache
ERROR 02-10 15:25:26 engine.py:366] raise_if_cache_size_invalid(num_gpu_blocks,
ERROR 02-10 15:25:26 engine.py:366] File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 493, in raise_if_cache_size_invalid
ERROR 02-10 15:25:26 engine.py:366] raise ValueError("No available memory for the cache blocks. "
ERROR 02-10 15:25:26 engine.py:366] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
I am running vLLM as `python3.11 -m vllm.entrypoints.openai.api_server --model /models/LLM/Meta-Llama-3-70B-Instruct/`
@feuler Thanks for the great work - I am trying to run
LLAMA 70bwith the unified memory patch. I downloaded the build wheels from the release page and ran it in a docker container. I tested with ,export PYTORCH_CUDA_ALLOC_CONF='use_uvm:True,uvm_oversubscription_ratio:5.0,uvm_access_pattern:balanced' 'andexport PYTORCH_CUDA_ALLOC_CONF='use_uvm:True,uvm_oversubscription_ratio:5.0,uvm_access_pattern:gpu_first' 'vLLM loads the model but fails during cache blocks. Below is the error message.