Configs, launchers, benchmarks, and tooling for running Qwen3.5 GGUF models locally with llama.cpp on a 16GB NVIDIA GPU
-
Updated
Apr 18, 2026 - Python
Configs, launchers, benchmarks, and tooling for running Qwen3.5 GGUF models locally with llama.cpp on a 16GB NVIDIA GPU
RTX 5090 & RTX 5060 Docker container with PyTorch + TensorFlow. First fully-tested Blackwell GPU support for ML/AI. CUDA 12.8, Python 3.11, Ubuntu 24.04. Works with RTX 50-series (5090/5080/5070/5060) and RTX 40-series.
High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell GeForce / RTX PRO (RTX 5090/5080/5070 Ti, RTX PRO 6000; sm_120). 200 tok/s decode on Qwen3.6-35B-A3B-NVFP4 MoE (RTX 5090).
Provide tested tools and configs to run Qwen 3.5 GGUF models efficiently on a single 16GB NVIDIA GPU using llama.cpp locally.
Downstream llama.cpp TurboQuant CUDA fork with adaptive KV layout selection for long-context inference on consumer Blackwell GPUs.
Pre-built onnxruntime-gpu 1.24.1 with Blackwell sm_120 CUDA kernels (RTX 5090/5080/5070)
Add a description, image, and links to the rtx-5080 topic page so that developers can more easily learn about it.
To associate your repository with the rtx-5080 topic, visit your repo's landing page and select "manage topics."