vLLM ERROR:  No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

@feuler Thanks for the great work - I am trying to run `LLAMA 70b` with the unified memory patch. I downloaded the build wheels from the release page and ran it in a docker container.  I tested with ,` export PYTORCH_CUDA_ALLOC_CONF='use_uvm:True,uvm_oversubscription_ratio:5.0,uvm_access_pattern:balanced'
'` and ` export PYTORCH_CUDA_ALLOC_CONF='use_uvm:True,uvm_oversubscription_ratio:5.0,uvm_access_pattern:gpu_first'
'`

vLLM loads the model but fails during cache blocks. Below is the error message. 

```
INFO 02-10 15:24:40 model_runner.py:1099] Loading model weights took 131.4185 GB
INFO 02-10 15:25:25 worker.py:241] Memory profiling takes 45.33 seconds
INFO 02-10 15:25:25 worker.py:241] the current vLLM instance can use total_gpu_memory (94.50GiB) x gpu_memory_utilization (0.90) = 85.05GiB
INFO 02-10 15:25:25 worker.py:241] model weights take 131.42GiB; non_torch_memory takes -39.47GiB; PyTorch activation peak memory takes 1.72GiB; the rest of the memory reserved for KV Cache is -8.62GiB.
INFO 02-10 15:25:26 gpu_executor.py:76] # GPU blocks: 0, # CPU blocks: 819
INFO 02-10 15:25:26 gpu_executor.py:80] Maximum concurrency for 8192 tokens per request: 0.00x
ERROR 02-10 15:25:26 engine.py:366] No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
ERROR 02-10 15:25:26 engine.py:366] Traceback (most recent call last):
ERROR 02-10 15:25:26 engine.py:366]   File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 357, in run_mp_engine
ERROR 02-10 15:25:26 engine.py:366]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 02-10 15:25:26 engine.py:366]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-10 15:25:26 engine.py:366]   File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 119, in from_engine_args
ERROR 02-10 15:25:26 engine.py:366]     return cls(ipc_path=ipc_path,
ERROR 02-10 15:25:26 engine.py:366]            ^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-10 15:25:26 engine.py:366]   File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 71, in __init__
ERROR 02-10 15:25:26 engine.py:366]     self.engine = LLMEngine(*args, **kwargs)
ERROR 02-10 15:25:26 engine.py:366]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-10 15:25:26 engine.py:366]   File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 276, in __init__
ERROR 02-10 15:25:26 engine.py:366]     self._initialize_kv_caches()
ERROR 02-10 15:25:26 engine.py:366]   File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 429, in _initialize_kv_caches
ERROR 02-10 15:25:26 engine.py:366]     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 02-10 15:25:26 engine.py:366]   File "/usr/local/lib/python3.11/dist-packages/vllm/executor/gpu_executor.py", line 83, in initialize_cache
ERROR 02-10 15:25:26 engine.py:366]     self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
ERROR 02-10 15:25:26 engine.py:366]   File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 266, in initialize_cache
ERROR 02-10 15:25:26 engine.py:366]     raise_if_cache_size_invalid(num_gpu_blocks,
ERROR 02-10 15:25:26 engine.py:366]   File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 493, in raise_if_cache_size_invalid
ERROR 02-10 15:25:26 engine.py:366]     raise ValueError("No available memory for the cache blocks. "
ERROR 02-10 15:25:26 engine.py:366] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

```
```

I am running vLLM as  `python3.11  -m vllm.entrypoints.openai.api_server  --model /models/LLM/Meta-Llama-3-70B-Instruct/`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM ERROR: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine. #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

vLLM ERROR: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine. #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

vLLM ERROR: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine. #5