Skip to content

Implement context-length dependent KV-cache and Compute Buffer aware …#335

Merged
Nexesenex merged 1 commit into
Nexesenex:lcpp_pr_kv_aware_layer_distribfrom
borebot:kv-compute-buffer-cache-aware-allocation
Jul 4, 2025
Merged

Implement context-length dependent KV-cache and Compute Buffer aware …#335
Nexesenex merged 1 commit into
Nexesenex:lcpp_pr_kv_aware_layer_distribfrom
borebot:kv-compute-buffer-cache-aware-allocation

Conversation

@Nexesenex

Copy link
Copy Markdown
Owner

…layer distribution for heterogeneous multi-GPU inference. Solves the problem of attemtping to run setups with different VRAM (e.g. 24GB cards with 6GB cards); previously layers were assigned without accounting for compute buffer, causing failure when one or more smaller GPUs could not hold the compute buffer.

  • Add requested_n_ctx parameter to llama_model_params
  • Implement 3-pass allocation algorithm accounting for compute buffers
  • Add device exclusion for insufficient memory (GPUs too small to allocate 1 layer + KV_cache + compute buffer excluded)
  • Add layer redistribution to make equitable use of included GPUs (may not be truly optimal)

…layer distribution for heterogeneous multi-GPU inference. Solves the problem of attemtping to run setups with different VRAM (e.g. 24GB cards with 6GB cards); previously layers were assigned without accounting for compute buffer, causing failure when one or more smaller GPUs could not hold the compute buffer.

- Add requested_n_ctx parameter to llama_model_params
- Implement 3-pass allocation algorithm accounting for compute buffers
- Add device exclusion for insufficient memory (GPUs too small to allocate 1 layer + KV_cache + compute buffer excluded)
- Add layer redistribution to make equitable use of included GPUs (may not be truly optimal)
@Nexesenex Nexesenex merged commit ccb3e6c into Nexesenex:lcpp_pr_kv_aware_layer_distrib Jul 4, 2025
46 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants