Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
053f476
feat(dflash): add DeepSeek V4 Flash backend
howard0su Jun 9, 2026
3504871
fix(deepseek4): handle u32/i32 metadata types in GGUF loader
howard0su Jun 9, 2026
c423a35
fix(deepseek4): use ggml_backend_tensor_alloc for proper buffer binding
howard0su Jun 9, 2026
f9accaf
fix(deepseek4): load all layers (fix layer_end default check)
howard0su Jun 9, 2026
731a66c
fix(deepseek4): auto-fallback to hybrid mode on GPU OOM
howard0su Jun 9, 2026
abab11e
fix(deepseek4): fix grouped output projection and attention placehold…
howard0su Jun 9, 2026
78c51f8
fix(deepseek4): disable HC pre-mix to fix reshape assertion
howard0su Jun 9, 2026
ddcfd23
fix(deepseek4): correct batched grouped output projection
howard0su Jun 9, 2026
a69c0a5
fix(deepseek4): correct compressor state dimensions
howard0su Jun 9, 2026
c92698d
debug: add layer progress prints for remote debugging
howard0su Jun 9, 2026
bebb91e
fix(deepseek4): cast APE from F16 to F32 before add
howard0su Jun 9, 2026
880495f
debug: more specific crash location prints
howard0su Jun 9, 2026
9ca201e
debug: trace MLA vs compressor crash
howard0su Jun 9, 2026
2144c7a
debug: trace inside MLA attention
howard0su Jun 9, 2026
f0b3a2f
fix(deepseek4): indexer score sum_rows axis fix
howard0su Jun 9, 2026
2bf59d0
fix(deepseek4): mark I32 position inputs for gallocr
howard0su Jun 9, 2026
32c3207
fix(deepseek4): skip RoPE in compressor/indexer (gallocr buffer issue)
howard0su Jun 9, 2026
64f72c7
chore(deepseek4): remove debug layer progress prints
howard0su Jun 9, 2026
57002a6
feat(deepseek4): implement tail RoPE, MLA attention, and compressor p…
howard0su Jun 9, 2026
14b3eaa
feat(deepseek4): implement CPU-side HC (Hierarchical Controller)
howard0su Jun 9, 2026
2291c93
fix(deepseek4): store all prefill KV rows in SWA ring buffer
howard0su Jun 9, 2026
4b0d95d
fix(deepseek4): use standard RoPE mode (sequential pairs), not NEOX
howard0su Jun 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 138 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Copilot Instructions — Lucebox Hub

## What is this repo

Local LLM inference engine with hand-tuned CUDA/HIP kernels for specific consumer GPUs. Speculative decoding, speculative prefill, and fused megakernels. Reference hardware: RTX 3090 (sm_86).

### Components

- **`server/`** — DFlash: C++/CUDA speculative-decoding server. OpenAI-compatible HTTP API (`/v1/chat/completions`, `/v1/responses`, `/v1/messages`). Built with CMake on top of vendored ggml (`server/deps/llama.cpp` submodule) — no PyTorch or libllama at runtime. Supports multiple model architectures dispatched at startup via `general.architecture` in the GGUF (qwen35, qwen36, laguna, gemma4).
- **`optimizations/megakernel/`** — Fused 24-layer CUDA megakernel for Qwen 3.5-0.8B (18 DeltaNet + 6 Attention layers, single persistent dispatch). Python + CUDAExtension (`setup.py` links against torch C++ libs). Research proof-of-concept, batch-size-1 only.
- **`optimizations/pflash/`** — PFlash: speculative prefill compression. A small drafter scores token importance, then the target only prefills spans that matter. The algorithm lives in `server/` C++; this directory is the Python bench harness (NIAH case generation, daemon protocol driver).
- **`harness/`** — Client launchers and regression tests. Shell scripts that spawn `dflash_server` and run real clients (Codex, Claude Code, OpenCode, Hermes, etc.). Auto-installs client CLIs under `.harness-work/`.

## Build commands

```bash
# ── Prerequisites ──
# System deps (Ubuntu 22.04/24.04): build-essential cmake git git-lfs nvcc
sudo bash server/scripts/setup_system.sh # idempotent, configures nvcc on PATH

# ── Submodules (required before CMake) ──
git submodule update --init --recursive

# ── Python workspace (uv 0.11+ is canonical; single .venv at repo root) ──
uv sync # dflash + pflash deps (pulls torch from cu128 index)
uv sync --extra megakernel # second pass: compiles megakernel CUDA extension against the venv's torch

# ── C++/CUDA server (CUDA 12+, CMake 3.18+) ──
cmake -B server/build -S server -DCMAKE_BUILD_TYPE=Release
cmake --build server/build --target dflash_server -j

# ── Megakernel bench ──
uv run --directory megakernel python final_bench.py
```

### CMake options

| Option | Default | Notes |
|--------|---------|-------|
| `CMAKE_CUDA_ARCHITECTURES` | `75;86` (auto-extended) | Set to match your GPU. 86=3090, 89=4090, 120=5090/Spark, 110=Thor |
| `DFLASH27B_GPU_BACKEND` | `cuda` | Set to `hip` for AMD ROCm builds |
| `DFLASH27B_FA_ALL_QUANTS` | `ON` | All FA KV-quant pairs (3× longer compile; set OFF for fast iteration) |
| `DFLASH27B_ENABLE_BSA` | `ON` | Block-Sparse Attention for PFlash (requires sm_80+) |
| `DFLASH27B_TESTS` | `ON` | Build C++ test binaries |

### Key CMake targets

| Target | Purpose |
|--------|---------|
| `dflash_server` | Production HTTP server binary |
| `test_dflash` | Speculative-decode daemon binary (driven by Python scripts via stdin/stdout) |
| `test_server_unit` | C++ unit tests (run via ctest) |
| `test_vs_oracle` | Numerics correctness test (needs GPU + model files) |
| `test_generate` | Autoregressive generation correctness |
| `test_flash_attn_sparse` | Flash attention sparse kernel test |
| `test_flashprefill_kernels` | PFlash CUDA kernel tests |
| `pflash_daemon` | PFlash compression daemon binary |

### Stale build directory

If cmake was previously run without CUDA (or with different settings), wipe the build directory first (`rm -rf server/build`) to avoid a stale compiler cache.

## Test commands

```bash
# ── C++ unit tests (no GPU model files needed) ──
cd server/build && ctest --output-on-failure -R server_unit --no-tests=error

# ── C++ GPU tests (require model files in server/models/) ──
./server/build/test_vs_oracle \
--target server/models/Qwen3.6-27B-Q4_K_M.gguf \
--draft server/models/draft/dflash-draft-3.6-q4_k_m.gguf

# Smoke tests (individual GPU loads)
./server/build/smoke_load_target --target server/models/Qwen3.6-27B-Q4_K_M.gguf
./server/build/smoke_load_draft --draft server/models/draft/dflash-draft-3.6-q4_k_m.gguf

# ── Python integration tests (spawn their own server or pass --url) ──
python server/scripts/test_server_prefix_cache.py
python server/scripts/test_server_prefix_cache.py --url http://localhost:8000
python server/scripts/test_multi_turn_prefix_cache.py
python server/scripts/test_full_compress_cache.py

# ── Python tests via pytest (single file or full suite) ──
uv run pytest server/tests/test_tokenizer.py # single test file
uv run pytest server/tests/ # full suite

# ── Megakernel correctness (includes output parity check vs reference) ──
uv run --directory megakernel python bench_pp_tg.py

# ── Workspace smoke (lockfile + frozen sync + import check) ──
bash scripts/check_uv_workspace.sh

# ── Harness benchmarks against a running server ──
python3 harness/client_test_runner.py bench \
--url http://127.0.0.1:8000 --suite he,agent --n-sample 3
```

## Architecture notes

- **uv workspace**: Root `pyproject.toml` declares members `server`, `optimizations/megakernel`, `optimizations/pflash`. All share a single `.venv` at repo root. The megakernel is `no-build-isolation` — it must link against the venv's cu128 torch wheel, so install requires the two-pass flow (`uv sync` then `uv sync --extra megakernel`).
- **C++ server internals**: `dflash_server` is a standalone C++ HTTP daemon (`server/src/server/`). Core runtime in `server/src/common/` (DDTree verify, draft graphs, speculative decode loop, KV cache, layer splitting). Model-specific forward paths in `server/src/qwen35/`, `server/src/laguna/`, `server/src/gemma4/`. Python scripts in `server/scripts/` drive the daemon binary via stdin/stdout protocol or HTTP.
- **Server API surface**: OpenAI Chat Completions (`/v1/chat/completions`), OpenAI Responses (`/v1/responses` for Codex), Anthropic Messages (`/v1/messages` for Claude Code), health check (`/health`), model listing (`/v1/models`).
- **Model files**: Never committed. Live in `server/models/` (gitignored). Downloaded via `hf download`. Default: Qwen3.6-27B Q4_K_M target + Lucebox Q4_K_M GGUF draft. The target path can also be set via `DFLASH_TARGET` env var.
- **GPU arch detection**: CMake auto-detects CUDA architectures from the installed toolkit. Override via `CMAKE_CUDA_ARCHITECTURES`. Megakernel uses `MEGAKERNEL_CUDA_ARCH` env var. On Volta/Turing (sm_70/75) BF16 draft weights auto-convert to FP16 at load.
- **HIP backend**: AMD GPU support (Strix Halo, RX 7900 XTX) via `DFLASH27B_GPU_BACKEND=hip`, ROCm 6+. Compatibility layer in `server/src/hip_compat/`.
- **Environment variables**: Server behavior controlled via `DFLASH_` / `DFLASH27B_` prefixed env vars (e.g., `DFLASH27B_KV_TQ3=1` for TQ3_0 KV cache, `DFLASH_FP_USE_BSA=1` for BSA dispatch, `DFLASH_TARGET_GPU=N`). Harness launchers use `DFLASH_SERVER_BIN`, `DFLASH_TARGET`, `DFLASH_DRAFT`, `MAX_CTX`, `BUDGET`, `VERIFY_MODE`.

## Conventions

- **Commit messages**: Conventional commits — `feat(megakernel):`, `fix(dflash):`, `perf(pflash):`, `docs(hub):`. Allowed types: `feat`, `fix`, `refactor`, `perf`, `docs`, `test`, `bench`, `chore`, `ci`.
- **One concern per PR**: Kernel/algorithm changes, docs, and build config go in separate commits or PRs.
- **Benchmarks required**: Kernel/algorithm PRs must include before/after numbers on the same hardware, same power limit, same warmup. Numbers without methodology don't get merged.
- **Correctness checks**: Run `bench_pp_tg.py` (megakernel) or `test_vs_oracle` (DFlash) to confirm changes don't regress output parity.
- **Python**: 3.12 (pinned in `.python-version`). Use `uv` for dependency management (not raw pip, though legacy `pip install` flow still works for individual subprojects).
- **C++ standard**: C++17.
- **No closed-source deps**: Everything must be reproducible from public sources.
- **Power methodology**: Efficiency numbers (tok/J) measure accelerator power only via NVML, following Hazy Research's Intelligence Per Watt methodology. Default sweet spot: `sudo nvidia-smi -pl 220` on RTX 3090.

## CI

GitHub Actions on PRs to `main` (`.github/workflows/ci.yml`):

1. **`uv workspace`** — `uv lock --check`, sync without torch, import smoke test.
2. **`build`** — Full CMake build (sm_86, BSA off, FA_ALL_QUANTS off for speed), C++ unit tests via `ctest -R server_unit`, two-pass megakernel compile (sm_75 then sm_86), extension import verification.

## Running the server

```bash
# Download default models (~18 GB)
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir server/models/
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q4_k_m.gguf --local-dir server/models/draft/

# Run with DDTree speculative decode
./server/build/dflash_server server/models/Qwen3.6-27B-Q4_K_M.gguf \
--draft server/models/draft/dflash-draft-3.6-q4_k_m.gguf \
--ddtree --ddtree-budget 22 --fa-window 2048 --port 8080
```
23 changes: 21 additions & 2 deletions server/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,7 @@ set(DFLASH27B_SRC_INCLUDE_DIRS
${CMAKE_CURRENT_SOURCE_DIR}/src/laguna
${CMAKE_CURRENT_SOURCE_DIR}/src/qwen3
${CMAKE_CURRENT_SOURCE_DIR}/src/gemma4
${CMAKE_CURRENT_SOURCE_DIR}/src/deepseek4
${CMAKE_CURRENT_SOURCE_DIR}/src/server
)

Expand All @@ -229,6 +230,11 @@ add_library(dflash_common STATIC
src/gemma4/gemma4_daemon.cpp
src/gemma4/gemma4_dflash_target.cpp
src/gemma4/gemma4_layer_split_adapter.cpp
# DeepSeek V4 Flash target arch
src/deepseek4/deepseek4_loader.cpp
src/deepseek4/deepseek4_graph.cpp
src/deepseek4/deepseek4_backend.cpp
src/deepseek4/deepseek4_daemon.cpp
src/flashprefill_q8.cpp
src/kv_cache.cpp
src/kv_quant.cpp
Expand Down Expand Up @@ -532,8 +538,10 @@ find_package(OpenMP)
if(OpenMP_CXX_FOUND)
target_link_libraries(dflash_common PRIVATE OpenMP::OpenMP_CXX)
endif()
if(DFLASH27B_GPU_BACKEND STREQUAL "hip")
target_link_libraries(dflash_common PRIVATE hip::host)
if(DFLASH27B_GPU_BACKEND STREQUAL "cuda")
target_link_libraries(dflash_common PUBLIC CUDA::cudart)
elseif(DFLASH27B_GPU_BACKEND STREQUAL "hip")
target_link_libraries(dflash_common PUBLIC hip::host)
endif()

if (CMAKE_CXX_COMPILER_ID STREQUAL "GNU" OR CMAKE_CXX_COMPILER_ID MATCHES "Clang")
Expand All @@ -552,6 +560,11 @@ if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/pflash_daemon.cpp")
add_executable(pflash_daemon test/pflash_daemon.cpp)
target_include_directories(pflash_daemon PRIVATE ${DFLASH27B_SRC_INCLUDE_DIRS})
target_link_libraries(pflash_daemon PRIVATE dflash_common ggml ${DFLASH27B_GGML_BACKEND_TARGET})
if(DFLASH27B_GPU_BACKEND STREQUAL "cuda")
target_link_libraries(pflash_daemon PRIVATE CUDA::cudart)
else()
target_link_libraries(pflash_daemon PRIVATE hip::host)
endif()
endif()

# ─── Tests (numerics vs oracle) ────────────────────────────────────
Expand Down Expand Up @@ -614,6 +627,12 @@ if(DFLASH27B_TESTS)
endif()
target_link_libraries(test_qwen35moe_swap_manager PRIVATE dflash_common)
endif()
if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/test_deepseek4_unit.cpp")
add_executable(test_deepseek4_unit test/test_deepseek4_unit.cpp)
target_include_directories(test_deepseek4_unit PRIVATE ${DFLASH27B_SRC_INCLUDE_DIRS} ${CMAKE_CURRENT_SOURCE_DIR}/include)
target_link_libraries(test_deepseek4_unit PRIVATE ggml ggml-cpu)
add_test(NAME deepseek4_unit COMMAND test_deepseek4_unit)
endif()
if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/smoke_load_draft.cpp")
add_executable(smoke_load_draft test/smoke_load_draft.cpp)
target_include_directories(smoke_load_draft PRIVATE ${DFLASH27B_SRC_INCLUDE_DIRS})
Expand Down
16 changes: 16 additions & 0 deletions server/src/common/backend_factory.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#include "qwen3_backend.h"
#include "gemma4_backend.h"
#include "gemma4_layer_split_adapter.h"
#include "deepseek4_backend.h"
#include "layer_split_backend.h"
#include "qwen35_layer_split_adapter.h"

Expand Down Expand Up @@ -202,6 +203,21 @@ std::unique_ptr<ModelBackend> create_backend(const BackendArgs & args) {
}
return backend;

} else if (arch == "deepseek4") {
DeepSeek4BackendConfig cfg;
cfg.model_path = args.model_path;
cfg.device = args.device;
cfg.stream_fd = args.stream_fd;
cfg.max_ctx = args.device.max_ctx;
cfg.chunk = args.chunk;

auto backend = std::make_unique<DeepSeek4Backend>(cfg);
if (!backend->init()) {
std::fprintf(stderr, "[backend_factory] DeepSeek4Backend init failed\n");
return nullptr;
}
return backend;

} else {
std::fprintf(stderr, "[backend_factory] unsupported architecture: %s\n",
arch.c_str());
Expand Down
Loading