From c955f57f95295d0feffb203f2fcf6a4e13eb5635 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 11 Jun 2026 01:13:18 +0000 Subject: [PATCH 1/9] docs: add June 2026 codebase review with prioritized findings Covers security posture, performance hot paths, correctness, cleanup, test gaps, and an assessment of Apple's Core AI framework relative to the MLX-based architecture. https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c --- docs/codebase-review-2026-06.md | 177 ++++++++++++++++++++++++++++++++ 1 file changed, 177 insertions(+) create mode 100644 docs/codebase-review-2026-06.md diff --git a/docs/codebase-review-2026-06.md b/docs/codebase-review-2026-06.md new file mode 100644 index 00000000..0e01e561 --- /dev/null +++ b/docs/codebase-review-2026-06.md @@ -0,0 +1,177 @@ +# Codebase Review — June 2026 + +A full review of the workspace (`higgs`, `higgs-engine`, `higgs-models`, `higgs-bench`) +covering security, performance, correctness, cleanup, and an assessment of Apple's +Core AI framework. File/line references are against the tree at the time of review; +items marked **[fixed]** were addressed on the branch that introduced this document. + +## Overall assessment + +The codebase is in good shape. The strict lint policy in `Cargo.toml` is genuinely +enforced — production code is free of `unwrap`/`panic` (all occurrences are +`#[cfg(test)]`-guarded), custom Metal kernels are cached correctly in `OnceLock` +statics, README and docs match actual behavior, and CI covers fmt, clippy +(pedantic + nursery), tests, MSRV, and a 70% coverage floor. + +The significant findings cluster in three areas: + +1. An O(n²) streaming decode path in the engine. +2. Insecure-by-default network posture and world-readable secrets. +3. Duplication: two prefix-cache implementations, duplicated prefill logic, and + near-duplicate route handlers. + +## Apple Core AI assessment + +[Core AI](https://developer.apple.com/documentation/coreai/) is Apple's WWDC 2026 +on-device inference framework — the runtime behind Apple Intelligence, now public. +It targets CPU, GPU, and the **Apple Neural Engine (ANE)**, with ahead-of-time model +compilation for instant load, stateful (KV-cache-style) execution, zero-copy data +paths, and PyTorch conversion tooling. + +**It is Swift-only today.** There is no documented C/C++ API surface, and therefore +no path to call it from Rust without a hand-rolled Swift shim. Recommendation: + +- **Keep MLX (mlx-rs) as the backbone.** It is the right tool for a Rust inference + server on Apple Silicon, and higgs already exploits it well (custom Metal kernels + for GatedDeltaNet and TurboQuant, `mlx_rs::fast::*` SDPA/RoPE paths). +- **Future opportunity — ANE offload for auxiliary models.** A small Swift + shim/XPC sidecar could run compact models on the ANE while the GPU stays + dedicated to the main model. Best candidates: speculative-decode draft models + (`higgs-engine/src/mtp.rs`), embedding models, and the SigLIP vision encoder + (`higgs-models/src/siglip.rs`). This is R&D, not a near-term refactor. +- **Cold start.** Core AI's AOT-compiled model assets could eventually reduce model + load times if a bridge materializes. +- **Watch for** a C API or community Rust bindings before committing to anything. + +## P0 — Security + +1. **Insecure network defaults.** The server defaulted to `0.0.0.0` + (`crates/higgs/src/config.rs`, `default_host()`), with no API key required and + `CorsLayer::permissive()` unconditionally applied (`crates/higgs/src/lib.rs`). + Any host on the LAN could use the server. **[fixed]** Default bind is now + `127.0.0.1`, CORS is configurable via `server.cors_origins` (disabled unless + set; `"*"` opts into permissive), and `higgs doctor` warns when binding a + non-loopback address without an API key. +2. **World-readable secrets.** Config files containing provider API keys were + written with default permissions (0o644) by `higgs init` + (`crates/higgs/src/daemon.rs`) and `higgs config set` + (`crates/higgs/src/cli_config.rs`). **[fixed]** Config files are now written + 0o600 on Unix, and doctor warns when an existing config containing an + `api_key` is group/world-readable. +3. **Upstream error bodies leaked into errors/metrics.** + `crates/higgs/src/proxy.rs` embedded the entire upstream error body in + `ProxyError`, which flows into logs and the metrics store; upstream bodies can + echo request headers/keys. **[fixed]** Bodies are truncated to a bounded length. +4. **Chat template execution surface.** `higgs-engine/src/chat_template.rs` + registers all of `minijinja_contrib` plus the pycompat method callback for + templates loaded from model directories (`chat_template.jinja` / + `tokenizer_config.json`). A malicious model repo therefore gets a fairly rich + template language. minijinja has no filesystem/process access so this is + bounded, but unbounded loops are possible. **[fixed]** Template execution is + now fuel-limited via `Environment::set_fuel`. Follow-up (not done): trim the + contrib filter set to what HF templates actually use. +5. **Missing resource bounds** (recommendation, not implemented): + - No configurable prompt-length limit before tokenization/prefill. + - Constrained decoding compiles schema-derived structures without a size cap + (`higgs-engine/src/constrained.rs`); a pathological JSON schema is a DoS vector. + - The engine `sessions` map (`higgs-engine/src/simple.rs`) has no cap or TTL. + Suggested: `max_prompt_tokens` and `max_schema_bytes` config fields plus a + session cap, all validated by doctor. + +## P1 — Performance + +6. **O(n²) streaming decode.** Each generated token triggered + `decode_tokens(tokens)` over the *entire* completion so far, then sliced off + the new suffix; with stop sequences enabled, `check_stop_sequences` also + rescanned the full text every step (`higgs-engine/src/simple.rs`, + streaming loop). Cost grows quadratically with completion length — measurable + on long generations. The batch engine had the identical pattern in + `batch_engine.rs`. **[fixed]** Both engines now share an incremental + detokenizer that decodes a bounded trailing window of tokens per step + (UTF-8 sequences split across tokens are held back until complete, which + also fixes replacement-char corruption in streamed output), and stop + sequences are scanned over only the new tail plus the maximum + stop-sequence overlap. The non-streaming path still re-decodes per token + when stop sequences are configured — same recipe applies if it shows up + in profiles. +7. **Prefix cache clones — corrected, no change needed.** An initial pass + flagged `PrefixCache::find_longest_prefix` for cloning the entire cached + `AnyCache` per lookup (`higgs-engine/src/prompt_cache.rs`). Inspection of + mlx-rs shows `Array::clone` is a refcounted handle copy (`mlx_array_set`), + so an `AnyCache` clone is O(layers) handle bumps, not a tensor copy — and + consumers genuinely need an owned copy because the forward pass mutates it. + The remaining open question (macOS-only to verify): a cached handle keeps + the underlying buffers alive, which can prevent MLX buffer donation and + force copy-on-write during subsequent decode steps. Worth profiling before + any restructuring. +8. **O(n) LRU eviction.** Eviction walks the whole radix tree to find the oldest + entry (`prompt_cache.rs::evict_lru`; similar in `paged_prefix_cache.rs`). + Acceptable at current cache sizes now that per-lookup clones are gone; revisit + with an access-ordered index if `max_entries` grows. +9. **TurboQuant dtype conversions — corrected, no change needed.** The + unconditional `as_dtype(Dtype::Float32)` calls in + `higgs-models/src/turboquant.rs` looked like redundant materializations, + but MLX core's `astype` short-circuits to a same-handle return when the + dtype already matches (`mlx/ops.cpp: if (dtype == a.dtype()) return a;`), + so these are no-ops beyond an FFI call. +10. Lower priority (not implemented): per-request `messages.clone()` in route + handlers (`crates/higgs/src/routes/chat.rs`); `Mutex` rather than `RwLock` + for read-heavy engine state (`simple.rs`); metrics store uses two `RwLock`s + per request (fine at local-server request rates). + +## P2 — Correctness + +11. **Two prefix-cache implementations.** `prompt_cache.rs` (~670 LOC, + clone-based radix tree) and `paged_prefix_cache.rs` (~1,070 LOC, block-paged) + coexist; the simple engine uses the paged one while the batch engine uses the + old one. Converge on the paged cache and delete the old implementation. +12. **Client disconnect cleanup.** The batch engine only notices a disconnected + client when `blocking_send` fails (`batch_engine.rs`), leaving in-flight + request state to be cleaned up late; on the HTTP side, a stream that ends + early can leave the metrics record without final token counts + (`routes/chat.rs` streaming finalization). Add cancellation-aware cleanup and + mark such records as cancelled. +13. **Error swallowing.** `cache/paged.rs::remove_session` logs only the first + block-free error; weight loading warns (rather than errors) on unmatched + keys (`higgs-models/src/lib.rs`), which is lenient enough to hide a corrupt + or mismatched checkpoint. Consider a strict-loading flag. +14. **PID-file TOCTTOU.** `read_pid → pid_is_alive → kill` in `daemon.rs` is + non-atomic. Low impact (stale-PID cleanup exists); noted for completeness. +15. **Unimplemented session APIs.** `step()`/`generate_session()` in `simple.rs` + return "not implemented" — either finish batched session generation or remove + the scaffolding so the API surface matches reality. + +## P3 — Cleanup / maintainability + +- **Duplicated prefill/decode-graph logic** between `simple.rs` and + `batch_engine.rs` (~100 LOC overlap each for prefill and decode-graph + construction). Extract shared helpers. +- **`generate_inner()` is ~1,100 lines** (`simple.rs`) covering standard decode, + MTP, prompt-lookup, and thinking-budget paths. Split per strategy. +- **Route handler duplication.** `routes/chat.rs` and `routes/anthropic.rs` + share the parse → route → stream/non-stream → metrics skeleton; several + handlers run 300-950 lines. A shared streaming-response helper would remove + most of it. +- **`qwen3_next.rs` is ~15k lines** — inline Metal kernel codegen, GDN layers, + MoE, and tests in one file. Split into submodules. +- **Attention/mask logic duplicated** across `gemma2.rs`, `phi3.rs`, + `starcoder2.rs`, `deepseek_v2.rs` despite shared helpers existing in + `utils.rs` (which `transformer.rs` already uses for Llama/Qwen2/Mistral). +- **Hardcoded `THINKING_BUDGET = 256`** (`simple.rs`) — should be a config field + (with doctor validation and README/init-template updates per project rules). +- **mlx-rs git pin** blocks crates.io publishing (release workflow skips publish + when git-pinned). Track upstream releases. +- **Ignored test due to global MLX state** (`higgs-models/src/yarn.rs`) — known + Metal/RNG state contamination across tests in one process; documented, but + worth revisiting if more tests start interfering. + +## Test gaps + +- Streaming SSE end-to-end over HTTP (chunk framing, `[DONE]`). +- Client-disconnect/cancellation paths (engine cleanup, metrics finalization). +- Cache exhaustion: prefix-cache eviction under pressure, paged-cache block + exhaustion, multi-session contention. +- Adversarial inputs: hostile chat templates, pathological constrained-decoding + schemas, very long prompts. +- Session lifecycle (create → generate → remove) and leak detection. +- Concurrent-request stress (rate limiter, lock contention). From 1b41a5c93696a7492ef4bf7e941982341edf50ab Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 11 Jun 2026 01:13:27 +0000 Subject: [PATCH 2/9] feat: harden default security posture - Bind 127.0.0.1 by default instead of 0.0.0.0; doctor warns when a non-loopback bind has no api_key set - Make CORS opt-in via server.cors_origins (unset = no CORS headers, ["*"] = permissive, otherwise an explicit origin allow-list) instead of unconditionally permissive; doctor validates entries - Write config and PID files with 0o600 permissions; doctor warns when an existing config containing API keys is group/world-accessible - Truncate upstream error bodies before embedding them in proxy errors and metrics, so upstream responses cannot leak echoed credentials - Fuel-limit chat template execution (templates load from third-party model directories) https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c --- Cargo.toml | 2 +- README.md | 2 + crates/higgs-engine/src/chat_template.rs | 9 + crates/higgs/src/cli_config.rs | 2 +- crates/higgs/src/config.rs | 39 +++- crates/higgs/src/daemon.rs | 10 +- crates/higgs/src/doctor.rs | 178 +++++++++++++++++- crates/higgs/src/lib.rs | 84 ++++++++- crates/higgs/src/main.rs | 19 +- crates/higgs/src/proxy.rs | 41 ++++ .../higgs/tests/integration/api_contract.rs | 4 +- crates/higgs/tests/integration/proxy_e2e.rs | 2 +- docs/configuration.md | 9 +- 13 files changed, 376 insertions(+), 25 deletions(-) diff --git a/Cargo.toml b/Cargo.toml index d64fa01f..4e883fda 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -96,7 +96,7 @@ json5 = "1" # Tokenization + Templates tokenizers = { version = "0.23", features = ["http"] } -minijinja = { version = "2", features = ["loader"] } +minijinja = { version = "2", features = ["loader", "fuel"] } minijinja-contrib = { version = "2", features = ["pycompat"] } # CLI + Configuration diff --git a/README.md b/README.md index f919a833..cb896738 100644 --- a/README.md +++ b/README.md @@ -27,6 +27,8 @@ Higgs is a single static Rust binary that serves local models, proxies to provid - Exact local model names now beat regex routes. - `/metrics` is a real endpoint, and `server.max_body_size` is enforced on API requests. - `higgs shellenv` and `higgs exec` now fail fast on bad config or an unreachable server. +- The server now binds `127.0.0.1` by default (was `0.0.0.0`). Set `server.host = "0.0.0.0"` (and an `api_key`) to expose it on the network. +- CORS headers are no longer sent unless `server.cors_origins` is set (`["*"]` restores the old permissive behavior). ## Quick Links diff --git a/crates/higgs-engine/src/chat_template.rs b/crates/higgs-engine/src/chat_template.rs index b3f27b23..655ed055 100644 --- a/crates/higgs-engine/src/chat_template.rs +++ b/crates/higgs-engine/src/chat_template.rs @@ -12,6 +12,11 @@ pub struct ChatMessage { pub tool_calls: Option>, } +/// Upper bound on template-engine instructions per render. Generous enough +/// for complex HF chat templates over long conversations, but stops a +/// malicious template from looping forever. +const TEMPLATE_FUEL: u64 = 5_000_000; + /// Renders chat messages using a Jinja2 template (`HuggingFace` format). pub struct ChatTemplateRenderer { env: Environment<'static>, @@ -24,6 +29,10 @@ impl ChatTemplateRenderer { /// Create a renderer from a Jinja2 template string. pub fn new>(template_source: S) -> Result { let mut env = Environment::new(); + // Templates come from model directories (tokenizer_config.json / + // chat_template.jinja), which are third-party content; bound execution + // so a hostile template cannot loop forever. + env.set_fuel(Some(TEMPLATE_FUEL)); env.add_filter("tojson", tojson_filter); minijinja_contrib::add_to_environment(&mut env); env.set_unknown_method_callback(minijinja_contrib::pycompat::unknown_method_callback); diff --git a/crates/higgs/src/cli_config.rs b/crates/higgs/src/cli_config.rs index 362146a6..f3283ce2 100644 --- a/crates/higgs/src/cli_config.rs +++ b/crates/higgs/src/cli_config.rs @@ -71,7 +71,7 @@ pub fn config_set(config_path: &Path, key: &str, value: &str) { std::process::exit(1); }); } - fs::write(config_path, &rendered).unwrap_or_else(|e| { + config::write_private_file(config_path, &rendered).unwrap_or_else(|e| { eprintln!("failed to write {}: {e}", config_path.display()); std::process::exit(1); }); diff --git a/crates/higgs/src/config.rs b/crates/higgs/src/config.rs index b9967aa4..7d1fc7f3 100644 --- a/crates/higgs/src/config.rs +++ b/crates/higgs/src/config.rs @@ -273,6 +273,9 @@ pub struct ServerSection { pub timeout: f64, #[serde(default = "default_max_body_size")] pub max_body_size: usize, + /// CORS allow-list of origins. Unset = no CORS headers are sent; + /// `["*"]` allows any origin (permissive). + pub cors_origins: Option>, } impl Default for ServerSection { @@ -285,12 +288,13 @@ impl Default for ServerSection { rate_limit: 0, timeout: default_timeout(), max_body_size: default_max_body_size(), + cors_origins: None, } } } fn default_host() -> String { - "0.0.0.0".to_owned() + "127.0.0.1".to_owned() } const fn default_port() -> u16 { @@ -957,6 +961,24 @@ pub fn default_config_path() -> PathBuf { config_dir().join("config.toml") } +/// Write a file with owner-only permissions (0o600 on Unix). Used for config +/// files (which may contain provider API keys) and other daemon-private +/// files. The mode is applied at creation; existing files keep their +/// permissions (doctor warns about loose ones). +pub fn write_private_file(path: &Path, contents: &str) -> std::io::Result<()> { + use std::io::Write as _; + + let mut options = std::fs::OpenOptions::new(); + options.write(true).create(true).truncate(true); + #[cfg(unix)] + { + use std::os::unix::fs::OpenOptionsExt as _; + options.mode(0o600); + } + let mut file = options.open(path)?; + file.write_all(contents.as_bytes()) +} + /// Validates that a profile name is safe for use in file paths. pub fn validate_profile_name(name: &str) -> Result<(), String> { if name.is_empty() { @@ -1057,7 +1079,7 @@ mod tests { assert!(config.models.is_empty()); assert!(config.providers.is_empty()); assert!(config.routes.is_empty()); - assert_eq!(config.server.host, "0.0.0.0"); + assert_eq!(config.server.host, "127.0.0.1"); assert_eq!(config.server.port, 8000); assert_eq!(config.server.max_tokens, 32768); assert!((config.server.timeout - 300.0).abs() < f64::EPSILON); @@ -1071,6 +1093,19 @@ mod tests { assert_eq!(config.default.provider, "higgs"); } + #[test] + #[cfg(unix)] + fn test_write_private_file_owner_only_permissions() { + use std::os::unix::fs::PermissionsExt as _; + let dir = std::env::temp_dir().join(format!("higgs-cfg-test-{}", std::process::id())); + std::fs::create_dir_all(&dir).unwrap(); + let path = dir.join("config.toml"); + write_private_file(&path, "[server]\n").unwrap(); + let mode = std::fs::metadata(&path).unwrap().permissions().mode() & 0o777; + assert_eq!(mode, 0o600); + let _ = std::fs::remove_dir_all(&dir); + } + #[test] fn test_simple_mode_builds_models() { let args = ServeArgs { diff --git a/crates/higgs/src/daemon.rs b/crates/higgs/src/daemon.rs index 244c4a63..f980c50e 100644 --- a/crates/higgs/src/daemon.rs +++ b/crates/higgs/src/daemon.rs @@ -47,7 +47,7 @@ pub fn remove_pid_file(profile: Option<&str>) { pub fn write_pid_file(profile: Option<&str>) { let pid = std::process::id(); - if let Err(e) = fs::write(pid_path(profile), pid.to_string()) { + if let Err(e) = config::write_private_file(&pid_path(profile), &pid.to_string()) { tracing::warn!("failed to write pid file: {e}"); } } @@ -128,12 +128,16 @@ pub fn cmd_init(profile: Option<&str>) { let default_config = format!( r#"[server] -host = "0.0.0.0" +# Bind to loopback by default. To expose on the network, set host = "0.0.0.0" +# and set an api_key. +host = "127.0.0.1" port = 8000 # max_tokens = 32768 # timeout = 300.0 # api_key = "sk-..." # rate_limit = 0 +# CORS origin allow-list for browser clients. Unset = no CORS headers. +# cors_origins = ["*"] # --- Local serving defaults --- # MLX profile applies to simple-engine local models. "auto" picks balanced for @@ -216,7 +220,7 @@ provider = "higgs" "# ); - if let Err(e) = fs::write(&path, default_config) { + if let Err(e) = config::write_private_file(&path, &default_config) { eprintln!("failed to write {}: {e}", path.display()); std::process::exit(1); } diff --git a/crates/higgs/src/doctor.rs b/crates/higgs/src/doctor.rs index e66f8c40..b570681e 100644 --- a/crates/higgs/src/doctor.rs +++ b/crates/higgs/src/doctor.rs @@ -31,7 +31,10 @@ fn fail(msg: &str, result: &mut DoctorResult) { } #[allow(clippy::print_stderr)] -pub async fn run_doctor(config: &HiggsConfig) -> DoctorResult { +pub async fn run_doctor( + config: &HiggsConfig, + config_path: Option<&std::path::Path>, +) -> DoctorResult { let mut result = DoctorResult { passes: 0, warnings: 0, @@ -41,6 +44,7 @@ pub async fn run_doctor(config: &HiggsConfig) -> DoctorResult { eprintln!("\x1b[1mhiggs doctor\x1b[0m\n"); check_config_valid(&mut result); + check_config_file_permissions(config, config_path, &mut result); check_server_section(config, &mut result); check_models(config, &mut result); check_duplicate_models(config, &mut result); @@ -134,6 +138,23 @@ fn check_server_section(config: &crate::config::HiggsConfig, result: &mut Doctor ); } + let non_loopback = matches!( + server.host.parse::(), + Ok(ip) if !ip.is_loopback() + ); + if non_loopback && server.api_key.is_none() { + warn( + &format!( + "server.host=\"{}\" is reachable from the network but server.api_key is unset; \ + anyone on the network can use this server", + server.host + ), + result, + ); + } + + check_cors_origins(server, non_loopback, result); + if server.rate_limit == 0 { pass("server.rate_limit=0 (disabled)", result); } else { @@ -144,6 +165,96 @@ fn check_server_section(config: &crate::config::HiggsConfig, result: &mut Doctor } } +fn check_cors_origins( + server: &crate::config::ServerSection, + non_loopback: bool, + result: &mut DoctorResult, +) { + match &server.cors_origins { + None => pass("server.cors_origins unset; no CORS headers sent", result), + Some(origins) if origins.iter().any(|o| o == "*") => { + if non_loopback { + warn( + "server.cors_origins allows any origin (\"*\") on a network-reachable host; \ + consider an explicit origin list", + result, + ); + } else { + pass("server.cors_origins=[\"*\"] (permissive)", result); + } + } + Some(origins) => { + let mut all_valid = true; + for origin in origins { + let parses = origin.parse::().is_ok(); + if !parses || !(origin.starts_with("http://") || origin.starts_with("https://")) { + fail( + &format!( + "server.cors_origins entry \"{origin}\" is not a valid origin \ + (expected e.g. \"https://example.com\")" + ), + result, + ); + all_valid = false; + } + } + if all_valid { + pass( + &format!("server.cors_origins lists {} origin(s)", origins.len()), + result, + ); + } + } + } +} + +/// Warn when the config file holding API keys is readable by other users. +#[cfg(unix)] +fn check_config_file_permissions( + config: &HiggsConfig, + config_path: Option<&std::path::Path>, + result: &mut DoctorResult, +) { + use std::os::unix::fs::PermissionsExt as _; + + let Some(path) = config_path else { return }; + let Ok(metadata) = std::fs::metadata(path) else { + return; + }; + let mode = metadata.permissions().mode() & 0o777; + let has_secrets = + config.server.api_key.is_some() || config.providers.values().any(|p| p.api_key.is_some()); + if (mode & 0o077) == 0 { + pass( + &format!("config file permissions are owner-only ({mode:03o})"), + result, + ); + } else if has_secrets { + warn( + &format!( + "config file {} is group/world-accessible (mode {mode:03o}) and contains API \ + keys; run: chmod 600 {}", + path.display(), + path.display() + ), + result, + ); + } else { + pass( + &format!("config file permissions {mode:03o} (no API keys present)"), + result, + ); + } +} + +#[cfg(not(unix))] +fn check_config_file_permissions( + _config: &HiggsConfig, + _config_path: Option<&std::path::Path>, + _result: &mut DoctorResult, +) { +} + fn model_label(model: &crate::config::ModelConfig) -> String { model.name.as_ref().map_or_else( || model.path.clone(), @@ -896,6 +1007,71 @@ mod tests { assert_eq!(result.warnings, 0); } + #[test] + fn test_non_loopback_host_without_api_key_warns() { + let config = server_with(|s| s.host = "0.0.0.0".to_owned()); + let mut result = empty_result(); + check_server_section(&config, &mut result); + assert!(result.warnings >= 1); + assert_eq!(result.failures, 0); + } + + #[test] + fn test_non_loopback_host_with_api_key_no_warning() { + let config = server_with(|s| { + s.host = "0.0.0.0".to_owned(); + s.api_key = Some("sk-test".to_owned()); + }); + let mut result = empty_result(); + check_server_section(&config, &mut result); + assert_eq!(result.warnings, 0); + assert_eq!(result.failures, 0); + } + + #[test] + fn test_cors_wildcard_on_loopback_passes() { + let config = server_with(|s| s.cors_origins = Some(vec!["*".to_owned()])); + let mut result = empty_result(); + check_server_section(&config, &mut result); + assert_eq!(result.warnings, 0); + assert_eq!(result.failures, 0); + } + + #[test] + fn test_cors_wildcard_on_network_host_warns() { + let config = server_with(|s| { + s.host = "0.0.0.0".to_owned(); + s.api_key = Some("sk-test".to_owned()); + s.cors_origins = Some(vec!["*".to_owned()]); + }); + let mut result = empty_result(); + check_server_section(&config, &mut result); + assert!(result.warnings >= 1); + assert_eq!(result.failures, 0); + } + + #[test] + fn test_cors_valid_origin_list_passes() { + let config = server_with(|s| { + s.cors_origins = Some(vec![ + "https://example.com".to_owned(), + "http://localhost:3000".to_owned(), + ]); + }); + let mut result = empty_result(); + check_server_section(&config, &mut result); + assert_eq!(result.warnings, 0); + assert_eq!(result.failures, 0); + } + + #[test] + fn test_cors_invalid_origin_fails() { + let config = server_with(|s| s.cors_origins = Some(vec!["not a url".to_owned()])); + let mut result = empty_result(); + check_server_section(&config, &mut result); + assert!(result.failures >= 1); + } + // -- Auto router -- #[test] diff --git a/crates/higgs/src/lib.rs b/crates/higgs/src/lib.rs index 7534c586..f49da3ad 100644 --- a/crates/higgs/src/lib.rs +++ b/crates/higgs/src/lib.rs @@ -30,14 +30,16 @@ use axum::{ Router, extract::DefaultBodyLimit, extract::{ConnectInfo, Request}, - http::StatusCode, + http::{HeaderValue, StatusCode}, middleware::{self, Next}, response::Response, routing::{get, post}, }; use governor::{Quota, RateLimiter, clock::DefaultClock, state::keyed::DefaultKeyedStateStore}; use tower_http::{ - cors::CorsLayer, timeout::TimeoutLayer, trace::TraceLayer, + cors::{Any, CorsLayer}, + timeout::TimeoutLayer, + trace::TraceLayer, validate_request::ValidateRequestHeaderLayer, }; @@ -59,6 +61,7 @@ pub fn build_router( api_key: Option, rate_limit: u32, max_body_size: usize, + cors_origins: Option>, ) -> Router { let timeout_duration = Duration::from_secs_f64(timeout_secs); @@ -93,16 +96,50 @@ pub fn build_router( api_routes = api_routes.layer(DefaultBodyLimit::max(max_body_size)); - Router::new() + let mut router = Router::new() .route("/health", get(routes::health::health)) .merge(api_routes) .layer(TraceLayer::new_for_http()) .layer(TimeoutLayer::with_status_code( StatusCode::GATEWAY_TIMEOUT, timeout_duration, - )) - .layer(CorsLayer::permissive()) - .with_state(state) + )); + + if let Some(cors) = build_cors_layer(cors_origins.as_deref()) { + router = router.layer(cors); + } + + router.with_state(state) +} + +/// Build a CORS layer from the configured origin allow-list. +/// +/// `None` (unset) sends no CORS headers; `["*"]` is fully permissive; +/// anything else is an explicit origin allow-list. +fn build_cors_layer(origins: Option<&[String]>) -> Option { + let origins = origins?; + if origins.iter().any(|o| o == "*") { + return Some(CorsLayer::permissive()); + } + let parsed: Vec = origins + .iter() + .filter_map(|origin| match origin.parse::() { + Ok(value) => Some(value), + Err(_) => { + tracing::warn!(origin = %origin, "ignoring invalid CORS origin"); + None + } + }) + .collect(); + if parsed.is_empty() { + return None; + } + Some( + CorsLayer::new() + .allow_origin(parsed) + .allow_methods(Any) + .allow_headers(Any), + ) } async fn rate_limit_middleware( @@ -120,3 +157,38 @@ async fn rate_limit_middleware( Err(_) => Err(StatusCode::TOO_MANY_REQUESTS), } } + +#[cfg(test)] +#[allow(clippy::panic, clippy::unwrap_used)] +mod cors_tests { + use super::build_cors_layer; + + #[test] + fn unset_origins_disable_cors() { + assert!(build_cors_layer(None).is_none()); + } + + #[test] + fn wildcard_enables_permissive_cors() { + let origins = vec!["*".to_owned()]; + assert!(build_cors_layer(Some(&origins)).is_some()); + } + + #[test] + fn explicit_origins_enable_cors() { + let origins = vec!["https://example.com".to_owned()]; + assert!(build_cors_layer(Some(&origins)).is_some()); + } + + #[test] + fn only_invalid_origins_disable_cors() { + let origins = vec!["\u{7f}invalid".to_owned()]; + assert!(build_cors_layer(Some(&origins)).is_none()); + } + + #[test] + fn empty_list_disables_cors() { + let origins: Vec = vec![]; + assert!(build_cors_layer(Some(&origins)).is_none()); + } +} diff --git a/crates/higgs/src/main.rs b/crates/higgs/src/main.rs index 6da7ed57..40d53405 100644 --- a/crates/higgs/src/main.rs +++ b/crates/higgs/src/main.rs @@ -70,22 +70,27 @@ async fn main() -> Result<(), Box> { } Commands::Doctor(ref args) => { init_tracing(cli.verbose); - let config = if let Some(ref path) = cli.config { - config::load_config_file(path, Some(args))? + let (config, config_path) = if let Some(ref path) = cli.config { + ( + config::load_config_file(path, Some(args))?, + Some(path.clone()), + ) } else if cli.profile.is_some() { let path = resolve_config_path(&cli)?; - config::load_config_file(&path, Some(args))? + let config = config::load_config_file(&path, Some(args))?; + (config, Some(path)) } else { let default = config::default_config_path(); if default.exists() { - config::load_config_file(&default, Some(args))? + let config = config::load_config_file(&default, Some(args))?; + (config, Some(default)) } else if !args.models.is_empty() { - config::build_simple_config(args)? + (config::build_simple_config(args)?, None) } else { return Err("no config to validate; use --config or 'higgs init'".into()); } }; - let result = higgs::doctor::run_doctor(&config).await; + let result = higgs::doctor::run_doctor(&config, config_path.as_deref()).await; if result.failures > 0 { std::process::exit(1); } @@ -196,6 +201,7 @@ async fn cmd_serve(cli: &Cli, args: &ServeArgs) -> Result<(), Box Result<(), Box String { + if text.len() <= max_bytes { + return text.to_owned(); + } + let mut end = max_bytes; + while !text.is_char_boundary(end) { + end -= 1; + } + let truncated = text.get(..end).unwrap_or_default(); + format!("{truncated}... (truncated, {} bytes total)", text.len()) +} + fn is_hop_by_hop(name: &http::header::HeaderName) -> bool { matches!( name.as_str(), @@ -181,6 +199,7 @@ pub async fn send_to_provider( .text() .await .unwrap_or_else(|_| String::from("(failed to read error body)")); + let error_body = truncate_error_body(&error_body, MAX_UPSTREAM_ERROR_BYTES); return Err(ServerError::ProxyError(format!( "upstream returned HTTP {status}: {error_body}" ))); @@ -262,6 +281,28 @@ pub fn extract_usage(body: &[u8]) -> (u64, u64) { mod tests { use super::*; + #[test] + fn short_error_body_unchanged() { + assert_eq!(truncate_error_body("bad request", 2048), "bad request"); + } + + #[test] + fn long_error_body_truncated_with_marker() { + let body = "x".repeat(5000); + let truncated = truncate_error_body(&body, 2048); + assert!(truncated.starts_with(&"x".repeat(2048))); + assert!(truncated.contains("truncated, 5000 bytes total")); + } + + #[test] + fn truncation_respects_utf8_boundaries() { + // 3-byte chars; a 4-byte cut would land mid-codepoint + let body = "錯".repeat(10); + let truncated = truncate_error_body(&body, 4); + assert!(truncated.starts_with('錯')); + assert!(truncated.contains("truncated")); + } + #[test] fn hop_by_hop_headers_filtered() { assert!(is_hop_by_hop(&http::header::HeaderName::from_static( diff --git a/crates/higgs/tests/integration/api_contract.rs b/crates/higgs/tests/integration/api_contract.rs index 41bd490c..d8716350 100644 --- a/crates/higgs/tests/integration/api_contract.rs +++ b/crates/higgs/tests/integration/api_contract.rs @@ -70,7 +70,7 @@ async fn metrics_endpoint_returns_snapshot_json() { error_body: None, }); - let app = build_router(build_test_state(Some(metrics)), 300.0, None, 0, 1024); + let app = build_router(build_test_state(Some(metrics)), 300.0, None, 0, 1024, None); let response = app .oneshot( Request::builder() @@ -93,7 +93,7 @@ async fn metrics_endpoint_returns_snapshot_json() { #[tokio::test] async fn request_body_limit_is_enforced() { - let app = build_router(build_test_state(None), 300.0, None, 0, 64); + let app = build_router(build_test_state(None), 300.0, None, 0, 64, None); let body = serde_json::json!({ "model": "gpt-4o", "messages": [{"role": "user", "content": "x".repeat(512)}] diff --git a/crates/higgs/tests/integration/proxy_e2e.rs b/crates/higgs/tests/integration/proxy_e2e.rs index 7c987476..b92b5c98 100644 --- a/crates/higgs/tests/integration/proxy_e2e.rs +++ b/crates/higgs/tests/integration/proxy_e2e.rs @@ -90,7 +90,7 @@ fn openai_chat_request_body() -> serde_json::Value { } fn build_app(state: Arc) -> axum::Router { - higgs::build_router(state, 300.0, None, 0, 10 * 1024 * 1024) + higgs::build_router(state, 300.0, None, 0, 10 * 1024 * 1024, None) } fn post_json(uri: &str, body: &serde_json::Value) -> Request { diff --git a/docs/configuration.md b/docs/configuration.md index bf50abbf..8644566b 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -14,7 +14,7 @@ This document collects the full CLI, environment, and config-file reference for | CLI Flag | Env Variable | Default | Description | |---|---|---|---| | `--model` | `HIGGS_MODELS` | *(required)* | Model path or HF ID (repeatable) | -| `--host` | `HIGGS_HOST` | `0.0.0.0` | Bind address | +| `--host` | `HIGGS_HOST` | `127.0.0.1` | Bind address (set `0.0.0.0` to expose on the network; pair with `--api-key`) | | `--port` | `HIGGS_PORT` | `8000` | Bind port | | `--max-tokens` | `HIGGS_MAX_TOKENS` | `32768` | Max generation tokens | | `--api-key` | `HIGGS_API_KEY` | *(none)* | Bearer token for auth | @@ -52,13 +52,18 @@ Run `higgs init` to create `~/.config/higgs/config.toml`: ```toml [server] -host = "0.0.0.0" +# Bind to loopback by default. To expose on the network, set host = "0.0.0.0" +# and set an api_key. +host = "127.0.0.1" port = 8000 # max_tokens = 32768 # timeout = 300.0 # max_body_size = 10485760 # api_key = "sk-..." # rate_limit = 0 +# CORS origin allow-list for browser clients. Unset = no CORS headers; +# ["*"] allows any origin. +# cors_origins = ["https://app.example.com"] # --- Local defaults --- [local] From a949e50fb3a5c2d650c2d7de367eb4afea444138 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 11 Jun 2026 01:13:35 +0000 Subject: [PATCH 3/9] perf: incremental detokenization for streaming generation Both streaming engines re-decoded the entire completion token buffer on every generated token and rescanned the full text for stop sequences, making streaming O(n^2) in completion length. Replace with a shared IncrementalDetok that decodes only a bounded trailing token window per step and emits the diff; stop sequences are scanned over the new tail plus the maximum stop-sequence overlap only. UTF-8 sequences split across tokens are held back until complete (and flushed at finish), which also stops replacement-char corruption in streamed chunks. https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c --- crates/higgs-engine/src/batch_engine.rs | 91 ++----- crates/higgs-engine/src/simple.rs | 326 ++++++++++++++++++++---- 2 files changed, 297 insertions(+), 120 deletions(-) diff --git a/crates/higgs-engine/src/batch_engine.rs b/crates/higgs-engine/src/batch_engine.rs index 8b3d640e..0fb68024 100644 --- a/crates/higgs-engine/src/batch_engine.rs +++ b/crates/higgs-engine/src/batch_engine.rs @@ -27,6 +27,7 @@ use crate::{ error::EngineError, model_loader, prompt_cache::PrefixCache, + simple::{IncrementalDetok, find_stop_in_tail}, }; /// Default maximum number of cached prefixes. @@ -63,7 +64,7 @@ struct ActiveRequest { constraint: Option, response_tx: tokio::sync::mpsc::Sender, prompt_len: u32, - prev_decoded_len: usize, + detok: IncrementalDetok, } // --------------------------------------------------------------------------- @@ -746,7 +747,7 @@ fn prefill_request( } // Send first token - let prev_decoded_len = first_text.len(); + let detok = IncrementalDetok::new(first_text.clone(), 1); if req .response_tx .blocking_send(StreamingOutput { @@ -773,7 +774,7 @@ fn prefill_request( constraint: req.constraint, response_tx: req.response_tx, prompt_len, - prev_decoded_len, + detok, })) }) } @@ -852,23 +853,21 @@ fn materialize_decode_step( let completion_len: u32 = ar.generated_tokens.len().try_into().unwrap_or(u32::MAX); - // Decode full text for diff and stop sequence checking - let full_text = tokenizer - .decode(&ar.generated_tokens, true) + // Decode only the trailing token window for diff and stop checking + let new_text = ar + .detok + .append(tokenizer, &ar.generated_tokens) .unwrap_or_default(); - let new_text = full_text - .get(ar.prev_decoded_len..) - .unwrap_or_default() - .to_owned(); - let old_decoded_len = ar.prev_decoded_len; - ar.prev_decoded_len = full_text.len(); - - let (final_new_text, hit_stop) = if ar.stop_sequences.is_empty() { + let emitted_before = ar.detok.text.len() - new_text.len(); + + let (mut final_new_text, hit_stop) = if ar.stop_sequences.is_empty() { (new_text, false) - } else if check_stop_sequences_simple(&full_text, &ar.stop_sequences) { - let truncated = truncate_at_stop(&full_text, &ar.stop_sequences); - let emit = truncated - .get(old_decoded_len..) + } else if let Some(pos) = find_stop_in_tail(&ar.detok.text, new_text.len(), &ar.stop_sequences) + { + let emit = ar + .detok + .text + .get(emitted_before..pos) .unwrap_or_default() .to_owned(); (emit, true) @@ -884,6 +883,13 @@ fn materialize_decode_step( .is_some_and(crate::constrained::ConstrainedGenerator::is_finished); let finished = is_eos || at_max || hit_stop || constraint_done; + if finished && !hit_stop { + final_new_text.push_str( + &ar.detok + .flush(tokenizer, &ar.generated_tokens) + .unwrap_or_default(), + ); + } let finish_reason = if is_eos || hit_stop || constraint_done { Some("stop".to_owned()) } else if at_max { @@ -912,20 +918,6 @@ fn check_stop_sequences_simple(text: &str, stop_sequences: &[String]) -> bool { stop_sequences.iter().any(|seq| text.contains(seq.as_str())) } -/// Truncate text at the earliest stop sequence. -fn truncate_at_stop(text: &str, stop_sequences: &[String]) -> String { - let mut earliest: Option = None; - for seq in stop_sequences { - if let Some(pos) = text.find(seq.as_str()) { - earliest = Some(earliest.map_or(pos, |prev| prev.min(pos))); - } - } - earliest.map_or_else( - || text.to_owned(), - |pos| text.get(..pos).unwrap_or_default().to_owned(), - ) -} - #[cfg(test)] #[allow(clippy::panic, clippy::unwrap_used, clippy::indexing_slicing)] mod tests { @@ -964,39 +956,6 @@ mod tests { assert!(check_stop_sequences_simple("text\n\nmore", &stops)); } - // ----------------------------------------------------------------------- - // truncate_at_stop - // ----------------------------------------------------------------------- - - #[test] - fn truncate_no_stop_returns_full_text() { - let stops = vec!["".to_owned()]; - assert_eq!(truncate_at_stop("hello world", &stops), "hello world"); - } - - #[test] - fn truncate_at_stop_removes_suffix() { - let stops = vec!["".to_owned()]; - assert_eq!(truncate_at_stop("hello", &stops), "hello"); - } - - #[test] - fn truncate_at_earliest_of_multiple_stops() { - let stops = vec!["BBB".to_owned(), "AAA".to_owned()]; - assert_eq!(truncate_at_stop("xAAAyBBBz", &stops), "x"); - } - - #[test] - fn truncate_empty_stops() { - assert_eq!(truncate_at_stop("hello", &[]), "hello"); - } - - #[test] - fn truncate_stop_at_start() { - let stops = vec!["STOP".to_owned()]; - assert_eq!(truncate_at_stop("STOPrest", &stops), ""); - } - // ----------------------------------------------------------------------- // materialize_decode_step // ----------------------------------------------------------------------- @@ -1017,7 +976,7 @@ mod tests { constraint: None, response_tx: tx, prompt_len: 5, - prev_decoded_len: 0, + detok: IncrementalDetok::new(String::new(), 0), }; (ar, rx) } diff --git a/crates/higgs-engine/src/simple.rs b/crates/higgs-engine/src/simple.rs index 9b37e73d..c22b4c60 100644 --- a/crates/higgs-engine/src/simple.rs +++ b/crates/higgs-engine/src/simple.rs @@ -1833,7 +1833,7 @@ impl SimpleEngine { tokens: &mut Vec, stop_sequences: &[String], sender: &tokio::sync::mpsc::Sender, - mut prev_decoded_len: usize, + mut detok: IncrementalDetok, enable_thinking: bool, ) -> Result<(), EngineError> { let has_stop_sequences = !stop_sequences.is_empty(); @@ -1956,36 +1956,25 @@ impl SimpleEngine { let completion_len = Self::completion_len(tokens)?; let is_max = completion_len >= max_tokens; - let full_text = self.decode_tokens(tokens)?; - let (final_new_text, hit_stop_seq) = if has_stop_sequences { - check_stop_sequences(&full_text, stop_sequences).map_or_else( - || { - ( - full_text - .get(prev_decoded_len..) - .unwrap_or_default() - .to_owned(), - false, - ) - }, - |truncated| { - let emit = truncated - .get(prev_decoded_len..) + let new_text = detok.append(&self.tokenizer, tokens)?; + let emitted_before = detok.text.len() - new_text.len(); + let (mut final_new_text, hit_stop_seq) = if has_stop_sequences { + find_stop_in_tail(&detok.text, new_text.len(), stop_sequences) + .map_or((new_text, false), |pos| { + let emit = detok + .text + .get(emitted_before..pos) .unwrap_or_default() .to_owned(); (emit, true) - }, - ) + }) } else { - ( - full_text - .get(prev_decoded_len..) - .unwrap_or_default() - .to_owned(), - false, - ) + (new_text, false) }; let step_finished = is_eos || is_max || hit_stop_seq; + if step_finished && !hit_stop_seq { + final_new_text.push_str(&detok.flush(&self.tokenizer, tokens)?); + } let finish_reason = if is_eos || hit_stop_seq { Some("stop".to_owned()) } else if is_max { @@ -1993,7 +1982,6 @@ impl SimpleEngine { } else { None }; - prev_decoded_len = full_text.len(); if step_finished { let elapsed = t_start.elapsed(); @@ -2034,20 +2022,16 @@ impl SimpleEngine { let completion_len = Self::completion_len(tokens)?; let is_max = completion_len >= max_tokens; - let full_text = self.decode_tokens(tokens)?; - let new_text = full_text - .get(prev_decoded_len..) - .unwrap_or_default() - .to_owned(); - let old_decoded_len = prev_decoded_len; - prev_decoded_len = full_text.len(); + let new_text = detok.append(&self.tokenizer, tokens)?; + let emitted_before = detok.text.len() - new_text.len(); - let (final_new_text, hit_stop_seq) = if has_stop_sequences { - check_stop_sequences(&full_text, stop_sequences).map_or( + let (mut final_new_text, hit_stop_seq) = if has_stop_sequences { + find_stop_in_tail(&detok.text, new_text.len(), stop_sequences).map_or( (new_text, false), - |truncated| { - let emit = truncated - .get(old_decoded_len..) + |pos| { + let emit = detok + .text + .get(emitted_before..pos) .unwrap_or_default() .to_owned(); (emit, true) @@ -2058,6 +2042,9 @@ impl SimpleEngine { }; let step_finished = is_eos || is_max || hit_stop_seq; + if step_finished && !hit_stop_seq { + final_new_text.push_str(&detok.flush(&self.tokenizer, tokens)?); + } let finish_reason = if is_eos || hit_stop_seq { Some("stop".to_owned()) } else if is_max { @@ -2236,7 +2223,7 @@ impl SimpleEngine { |truncated| (truncated, true), ) }; - let mut prev_decoded_len = first_decoded.len(); + let mut detok = IncrementalDetok::new(first_decoded, all_tokens.len()); let first_is_eos = self.eos_token_ids.contains(&first_token_id); let finished = first_is_eos || first_hit_stop || 1 >= max_tokens; @@ -2289,7 +2276,7 @@ impl SimpleEngine { &mut all_tokens, stop_sequences, sender, - prev_decoded_len, + detok, enable_thinking, ); } @@ -2387,22 +2374,18 @@ impl SimpleEngine { let completion_len = Self::completion_len(&all_tokens)?; - let full_text = self.decode_tokens(&all_tokens)?; - let new_text = full_text - .get(prev_decoded_len..) - .unwrap_or_default() - .to_owned(); - let old_decoded_len = prev_decoded_len; - prev_decoded_len = full_text.len(); + let new_text = detok.append(&self.tokenizer, &all_tokens)?; + let emitted_before = detok.text.len() - new_text.len(); - let (final_new_text, hit_stop_seq) = if stop_sequences.is_empty() { + let (mut final_new_text, hit_stop_seq) = if stop_sequences.is_empty() { (new_text, false) } else { - check_stop_sequences(&full_text, stop_sequences).map_or( + find_stop_in_tail(&detok.text, new_text.len(), stop_sequences).map_or( (new_text, false), - |truncated| { - let emit = truncated - .get(old_decoded_len..) + |pos| { + let emit = detok + .text + .get(emitted_before..pos) .unwrap_or_default() .to_owned(); (emit, true) @@ -2417,6 +2400,10 @@ impl SimpleEngine { .is_some_and(crate::constrained::ConstrainedGenerator::is_finished); let step_finished = is_eos || is_max || hit_stop_seq || constraint_done; + if step_finished && !hit_stop_seq { + final_new_text.push_str(&detok.flush(&self.tokenizer, &all_tokens)?); + } + let finish_reason = if is_eos || hit_stop_seq || constraint_done { Some("stop".to_owned()) } else if is_max { @@ -2460,6 +2447,134 @@ impl SimpleEngine { } } +/// Cap on the trailing token window re-decoded per streaming step. +/// Generous for any multi-token UTF-8 sequence; prevents a stream of +/// undecodable tokens from growing the window without bound. +const MAX_DETOK_WINDOW: usize = 64; + +/// Incremental streaming detokenizer. +/// +/// Re-decoding the full completion on every generated token is O(n^2) in +/// completion length. Instead, decode only `tokens[prefix_offset..]` and emit +/// the difference against `tokens[prefix_offset..read_offset]`; both windows +/// start at the same token, so tokenizer boundary effects cancel out. Text +/// ending in an incomplete UTF-8 sequence (trailing replacement char) is held +/// back until a later token completes it. +pub(crate) struct IncrementalDetok { + /// Start of the decode window; advanced whenever text is emitted. + prefix_offset: usize, + /// Number of leading tokens already represented in `text`. + read_offset: usize, + /// All text decoded so far (streamed to the client incrementally). + pub(crate) text: String, +} + +impl IncrementalDetok { + /// Start from text already decoded for the first `token_count` tokens. + pub(crate) const fn new(text: String, token_count: usize) -> Self { + Self { + prefix_offset: 0, + read_offset: token_count, + text, + } + } + + fn decode(tokenizer: &Tokenizer, tokens: &[u32]) -> Result { + tokenizer + .decode(tokens, true) + .map_err(|e| EngineError::Tokenization(e.to_string())) + } + + /// Decode the trailing window of `tokens`, appending newly stable text to + /// `self.text` and returning it. + pub(crate) fn append( + &mut self, + tokenizer: &Tokenizer, + tokens: &[u32], + ) -> Result { + let prefix_tokens = tokens + .get(self.prefix_offset..self.read_offset) + .unwrap_or_default(); + let window_tokens = tokens.get(self.prefix_offset..).unwrap_or_default(); + let prefix_text = Self::decode(tokenizer, prefix_tokens)?; + let window_text = Self::decode(tokenizer, window_tokens)?; + + let over_window = window_tokens.len() > MAX_DETOK_WINDOW; + if window_text.len() > prefix_text.len() + && (!window_text.ends_with('\u{FFFD}') || over_window) + { + let new_text = window_text + .get(prefix_text.len()..) + .unwrap_or_default() + .to_owned(); + self.prefix_offset = self.read_offset; + self.read_offset = tokens.len(); + self.text.push_str(&new_text); + return Ok(new_text); + } + if over_window && window_text.len() == prefix_text.len() { + // The pending tokens decode to nothing (e.g. skipped special + // tokens); drop them so the window stays bounded. + self.prefix_offset = tokens.len(); + self.read_offset = tokens.len(); + } + Ok(String::new()) + } + + /// Emit any text still held back by `append` (a trailing incomplete UTF-8 + /// sequence). Called when generation finishes so the total streamed text + /// matches a full decode of the token buffer. + pub(crate) fn flush( + &mut self, + tokenizer: &Tokenizer, + tokens: &[u32], + ) -> Result { + if self.read_offset >= tokens.len() { + return Ok(String::new()); + } + let prefix_tokens = tokens + .get(self.prefix_offset..self.read_offset) + .unwrap_or_default(); + let window_tokens = tokens.get(self.prefix_offset..).unwrap_or_default(); + let prefix_text = Self::decode(tokenizer, prefix_tokens)?; + let window_text = Self::decode(tokenizer, window_tokens)?; + let new_text = window_text + .get(prefix_text.len()..) + .unwrap_or_default() + .to_owned(); + self.prefix_offset = self.read_offset; + self.read_offset = tokens.len(); + self.text.push_str(&new_text); + Ok(new_text) + } +} + +/// Find the earliest stop-sequence occurrence that could involve the newly +/// appended text, scanning only the tail of `text` rather than the whole +/// buffer. Returns the absolute byte position where the match starts. +pub(crate) fn find_stop_in_tail( + text: &str, + new_len: usize, + stop_sequences: &[String], +) -> Option { + let max_stop = stop_sequences.iter().map(String::len).max().unwrap_or(0); + if max_stop == 0 { + return None; + } + let mut start = text.len().saturating_sub(new_len + max_stop - 1); + while !text.is_char_boundary(start) { + start -= 1; + } + let tail = text.get(start..)?; + let mut earliest: Option = None; + for seq in stop_sequences { + if let Some(pos) = tail.find(seq.as_str()) { + earliest = Some(earliest.map_or(pos, |prev| prev.min(pos))); + } + } + earliest.map(|pos| start + pos) +} + /// Check if any stop sequence appears in the generated text. /// Returns `Some(truncated_text)` if a stop sequence was found, None otherwise. fn check_stop_sequences(text: &str, stop_sequences: &[String]) -> Option { @@ -2587,8 +2702,8 @@ fn detect_thinking_support(model_dir: &Path) -> bool { #[allow(clippy::panic, clippy::unwrap_used)] mod tests { use super::{ - adaptive_draft_depth_for_cap, check_stop_sequences, derive_model_name, - estimate_paged_kv_blocks, parse_enabled_flag, + IncrementalDetok, Tokenizer, adaptive_draft_depth_for_cap, check_stop_sequences, + derive_model_name, estimate_paged_kv_blocks, find_stop_in_tail, parse_enabled_flag, }; use std::path::Path; @@ -2857,4 +2972,107 @@ mod tests { assert_eq!(parse_enabled_flag(Some("no")), Some(false)); assert_eq!(parse_enabled_flag(Some("unexpected")), None); } + + // ----------------------------------------------------------------------- + // find_stop_in_tail + // ----------------------------------------------------------------------- + + #[test] + fn find_stop_in_tail_empty_stops_returns_none() { + assert_eq!(find_stop_in_tail("hello world", 5, &[]), None); + } + + #[test] + fn find_stop_in_tail_finds_stop_in_new_text() { + let stops = vec!["".to_owned()]; + // "hello" with the last 4 bytes new + assert_eq!(find_stop_in_tail("hello", 4, &stops), Some(5)); + } + + #[test] + fn find_stop_in_tail_finds_stop_spanning_boundary() { + let stops = vec!["STOP".to_owned()]; + // "abcSTOP" where only "OP" is new; "ST" was emitted previously + assert_eq!(find_stop_in_tail("abcSTOP", 2, &stops), Some(3)); + } + + #[test] + fn find_stop_in_tail_ignores_stop_fully_in_old_text() { + let stops = vec!["XY".to_owned()]; + // "XYabcdefgh" with 2 new bytes: the scan window covers the last + // 2 + (2 - 1) = 3 bytes only, so the old "XY" is not rescanned. + assert_eq!(find_stop_in_tail("XYabcdefgh", 2, &stops), None); + } + + #[test] + fn find_stop_in_tail_earliest_of_multiple() { + let stops = vec!["BBB".to_owned(), "AAA".to_owned()]; + assert_eq!(find_stop_in_tail("xAAAyBBBz", 9, &stops), Some(1)); + } + + #[test] + fn find_stop_in_tail_handles_multibyte_boundary() { + let stops = vec!["端".to_owned()]; + // The tail start can land mid-codepoint; it must back up to a + // char boundary instead of panicking. "日本語端" = 4 chars, 12 bytes. + assert_eq!(find_stop_in_tail("日本語端", 3, &stops), Some(9)); + } + + // ----------------------------------------------------------------------- + // IncrementalDetok + // ----------------------------------------------------------------------- + + /// Minimal word-level tokenizer with a byte-level decoder for detok tests. + fn word_tokenizer() -> Tokenizer { + let json = r#"{ + "version": "1.0", + "truncation": null, + "padding": null, + "added_tokens": [], + "normalizer": null, + "pre_tokenizer": null, + "post_processor": null, + "decoder": { + "type": "ByteLevel", + "add_prefix_space": true, + "trim_offsets": true, + "use_regex": true + }, + "model": { + "type": "WordLevel", + "vocab": {"Hello": 0, "Ġworld": 1, "!": 2}, + "unk_token": "Hello" + } + }"#; + Tokenizer::from_bytes(json.as_bytes()).unwrap() + } + + #[test] + fn incremental_detok_emits_per_token_diffs() { + let tokenizer = word_tokenizer(); + let mut tokens: Vec = vec![0]; + let first = tokenizer.decode(&tokens, true).unwrap(); + let mut detok = IncrementalDetok::new(first.clone(), tokens.len()); + + tokens.push(1); + let second = detok.append(&tokenizer, &tokens).unwrap(); + tokens.push(2); + let third = detok.append(&tokenizer, &tokens).unwrap(); + + let full = tokenizer.decode(&tokens, true).unwrap(); + assert_eq!(format!("{first}{second}{third}"), full); + assert_eq!(detok.text, full); + } + + #[test] + fn incremental_detok_flush_without_pending_is_empty() { + let tokenizer = word_tokenizer(); + let mut tokens: Vec = vec![0]; + let first = tokenizer.decode(&tokens, true).unwrap(); + let mut detok = IncrementalDetok::new(first, tokens.len()); + + tokens.push(1); + detok.append(&tokenizer, &tokens).unwrap(); + assert_eq!(detok.flush(&tokenizer, &tokens).unwrap(), ""); + } } From 0d8e2594809589fe405a3b8091e669164b2e8e52 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 11 Jun 2026 01:39:47 +0000 Subject: [PATCH 4/9] fix: address CI lint and test failures - doctor.rs: replace (mode & 0o077) == 0 with mode.trailing_zeros() >= 6 to fix clippy::verbose_bit_mask (denied by -D warnings) - proxy.rs: rename error_body to error_body_raw before truncation to fix clippy::shadow_reuse (denied lint) - lib.rs: rename build_cors_layer param origins -> origins_opt to fix clippy::shadow_reuse on the `let origins = origins_opt?` unwrap; rewrite filter_map match to Result::map_or_else to fix clippy::single_match_else and clippy::option_if_let_else (both denied) - tests/integration/router.rs:184: update build_router_is_public_and_callable type assertion to match new 6-parameter signature (cors_origins was added) https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c --- crates/higgs/src/doctor.rs | 2 +- crates/higgs/src/lib.rs | 18 ++++++++++-------- crates/higgs/src/proxy.rs | 4 ++-- crates/higgs/tests/integration/router.rs | 3 ++- 4 files changed, 15 insertions(+), 12 deletions(-) diff --git a/crates/higgs/src/doctor.rs b/crates/higgs/src/doctor.rs index b570681e..7095ba73 100644 --- a/crates/higgs/src/doctor.rs +++ b/crates/higgs/src/doctor.rs @@ -224,7 +224,7 @@ fn check_config_file_permissions( let mode = metadata.permissions().mode() & 0o777; let has_secrets = config.server.api_key.is_some() || config.providers.values().any(|p| p.api_key.is_some()); - if (mode & 0o077) == 0 { + if mode.trailing_zeros() >= 6 { pass( &format!("config file permissions are owner-only ({mode:03o})"), result, diff --git a/crates/higgs/src/lib.rs b/crates/higgs/src/lib.rs index f49da3ad..a6d7a173 100644 --- a/crates/higgs/src/lib.rs +++ b/crates/higgs/src/lib.rs @@ -116,19 +116,21 @@ pub fn build_router( /// /// `None` (unset) sends no CORS headers; `["*"]` is fully permissive; /// anything else is an explicit origin allow-list. -fn build_cors_layer(origins: Option<&[String]>) -> Option { - let origins = origins?; +fn build_cors_layer(origins_opt: Option<&[String]>) -> Option { + let origins = origins_opt?; if origins.iter().any(|o| o == "*") { return Some(CorsLayer::permissive()); } let parsed: Vec = origins .iter() - .filter_map(|origin| match origin.parse::() { - Ok(value) => Some(value), - Err(_) => { - tracing::warn!(origin = %origin, "ignoring invalid CORS origin"); - None - } + .filter_map(|origin| { + origin.parse::().map_or_else( + |_| { + tracing::warn!(origin = %origin, "ignoring invalid CORS origin"); + None + }, + Some, + ) }) .collect(); if parsed.is_empty() { diff --git a/crates/higgs/src/proxy.rs b/crates/higgs/src/proxy.rs index 406f2d0a..3cfef9cd 100644 --- a/crates/higgs/src/proxy.rs +++ b/crates/higgs/src/proxy.rs @@ -195,11 +195,11 @@ pub async fn send_to_provider( let status = upstream.status().as_u16(); if status >= 400 { - let error_body = upstream + let error_body_raw = upstream .text() .await .unwrap_or_else(|_| String::from("(failed to read error body)")); - let error_body = truncate_error_body(&error_body, MAX_UPSTREAM_ERROR_BYTES); + let error_body = truncate_error_body(&error_body_raw, MAX_UPSTREAM_ERROR_BYTES); return Err(ServerError::ProxyError(format!( "upstream returned HTTP {status}: {error_body}" ))); diff --git a/crates/higgs/tests/integration/router.rs b/crates/higgs/tests/integration/router.rs index 988eb014..fdf8f54b 100644 --- a/crates/higgs/tests/integration/router.rs +++ b/crates/higgs/tests/integration/router.rs @@ -181,7 +181,8 @@ fn build_router_is_public_and_callable() { // This test verifies that build_router was successfully extracted to lib.rs. // We cannot actually call it without a real AppState, but we can verify // the function signature exists and is accessible. - let _: fn(Arc, f64, Option, u32, usize) -> axum::Router = build_router; + let _: fn(Arc, f64, Option, u32, usize, Option>) -> axum::Router = + build_router; } // --------------------------------------------------------------------------- From c1953125d8399f899236d7917f26b94051fccb4b Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 11 Jun 2026 01:43:38 +0000 Subject: [PATCH 5/9] fix: split long first doc paragraphs for clippy https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c --- crates/higgs-engine/src/simple.rs | 7 ++++--- crates/higgs/src/config.rs | 9 +++++---- 2 files changed, 9 insertions(+), 7 deletions(-) diff --git a/crates/higgs-engine/src/simple.rs b/crates/higgs-engine/src/simple.rs index c22b4c60..1d9a0077 100644 --- a/crates/higgs-engine/src/simple.rs +++ b/crates/higgs-engine/src/simple.rs @@ -2549,9 +2549,10 @@ impl IncrementalDetok { } } -/// Find the earliest stop-sequence occurrence that could involve the newly -/// appended text, scanning only the tail of `text` rather than the whole -/// buffer. Returns the absolute byte position where the match starts. +/// Find the earliest stop-sequence occurrence that could involve the newly appended text. +/// +/// Scans only the tail of `text` rather than the whole buffer. +/// Returns the absolute byte position where the match starts. pub(crate) fn find_stop_in_tail( text: &str, new_len: usize, diff --git a/crates/higgs/src/config.rs b/crates/higgs/src/config.rs index 7d1fc7f3..43373c47 100644 --- a/crates/higgs/src/config.rs +++ b/crates/higgs/src/config.rs @@ -961,10 +961,11 @@ pub fn default_config_path() -> PathBuf { config_dir().join("config.toml") } -/// Write a file with owner-only permissions (0o600 on Unix). Used for config -/// files (which may contain provider API keys) and other daemon-private -/// files. The mode is applied at creation; existing files keep their -/// permissions (doctor warns about loose ones). +/// Write a file with owner-only permissions (0o600 on Unix). +/// +/// Used for config files (which may contain provider API keys) and other +/// daemon-private files. The mode is applied at creation; existing files +/// keep their permissions (doctor warns about loose ones). pub fn write_private_file(path: &Path, contents: &str) -> std::io::Result<()> { use std::io::Write as _; From d0845926a64c0179646855b628ef0da737e31b2f Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 11 Jun 2026 01:48:58 +0000 Subject: [PATCH 6/9] fix: type alias for build_router signature assertion MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduce `BuildRouterFn` type alias in the integration test to satisfy `clippy::type_complexity` (fatal under `-D warnings` with nursery group). No engine-crate changes were needed: audit of all new code added by this PR in `crates/higgs-engine/src/simple.rs`, `batch_engine.rs`, and `chat_template.rs` found no additional fatal lint violations — `use_self`, `option_if_let_else`, `redundant_clone`, `uninlined_format_args`, `items_after_statements`, `doc_markdown`, `missing_const_for_fn`, and `trivially_copy_pass_by_ref` were each reviewed and ruled out. https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c --- crates/higgs/tests/integration/router.rs | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/crates/higgs/tests/integration/router.rs b/crates/higgs/tests/integration/router.rs index fdf8f54b..9e7b9e3e 100644 --- a/crates/higgs/tests/integration/router.rs +++ b/crates/higgs/tests/integration/router.rs @@ -176,13 +176,15 @@ async fn streaming_chat_returns_sse_content_type() { // build_router smoke test (verifies the function is callable and public) // --------------------------------------------------------------------------- +type BuildRouterFn = + fn(Arc, f64, Option, u32, usize, Option>) -> axum::Router; + #[test] fn build_router_is_public_and_callable() { // This test verifies that build_router was successfully extracted to lib.rs. // We cannot actually call it without a real AppState, but we can verify // the function signature exists and is accessible. - let _: fn(Arc, f64, Option, u32, usize, Option>) -> axum::Router = - build_router; + let _: BuildRouterFn = build_router; } // --------------------------------------------------------------------------- From 722218df3cfb10c8671721cf45888ec45e31187c Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 11 Jun 2026 02:40:04 +0000 Subject: [PATCH 7/9] fix: route first streamed token through incremental detokenizer MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Previously the first generated token was decoded with a raw tokenizer call, bypassing IncrementalDetok. This caused two bugs: 1. Partial UTF-8 sequences in the first token were never held back (they would appear as replacement characters instead of waiting for the next token to complete the codepoint). 2. When a stop sequence appeared mid-first-token, the prefix text before the stop was dropped entirely (batch_engine sent "" instead of the valid prefix). Site A (simple.rs generate_streaming_inner): initialize detok with empty state, run the first token through detok.append(), mirror the main loop's find_stop_in_tail stop check, and flush on EOS/max finish. Site B (batch_engine.rs prefill_request): same empty-init + append approach; fix the dropped-prefix bug by extracting the pre-stop slice from detok.text instead of sending ""; flush on non-stop finish. check_stop_sequences_simple (and its tests) removed — now unused. Also fix detach() PID file write in daemon.rs: replace fs::write with config::write_private_file so the PID file gets mode 0600 (private), consistent with write_pid_file() higher in the same file. New regression tests added to simple.rs: - incremental_detok_first_token_partial_utf8_held_back - incremental_detok_flush_emits_pending - find_stop_in_tail_first_token_prefix_and_stop https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c --- crates/higgs-engine/src/batch_engine.rs | 71 ++++++++------------- crates/higgs-engine/src/simple.rs | 82 ++++++++++++++++++++++--- crates/higgs/src/daemon.rs | 2 +- 3 files changed, 101 insertions(+), 54 deletions(-) diff --git a/crates/higgs-engine/src/batch_engine.rs b/crates/higgs-engine/src/batch_engine.rs index 0fb68024..cef09d41 100644 --- a/crates/higgs-engine/src/batch_engine.rs +++ b/crates/higgs-engine/src/batch_engine.rs @@ -723,20 +723,40 @@ fn prefill_request( .as_ref() .map(|lp| lp.materialize(first_token_id)); - // Decode the first token for text diff tracking - let first_text = tokenizer - .decode(&[first_token_id], true) + // Decode the first token incrementally (routes through IncrementalDetok + // so partial UTF-8 is held back and prefix-before-stop is correctly emitted). + let mut detok = IncrementalDetok::new(String::new(), 0); + let first_chunk = detok + .append(tokenizer, &[first_token_id]) .unwrap_or_default(); + let emitted_before = detok.text.len() - first_chunk.len(); // Check if we're done after the first token let is_eos = eos_token_ids.contains(&first_token_id); - let hit_stop = check_stop_sequences_simple(&first_text, &req.stop_sequences); + let hit_stop = !req.stop_sequences.is_empty() + && find_stop_in_tail(&detok.text, first_chunk.len(), &req.stop_sequences).is_some(); let at_max = req.max_tokens <= 1; if is_eos || hit_stop || at_max { let finish_reason = if is_eos || hit_stop { "stop" } else { "length" }; + let mut send_text = if hit_stop { + // Emit any prefix text before the stop sequence + find_stop_in_tail(&detok.text, first_chunk.len(), &req.stop_sequences) + .and_then(|pos| detok.text.get(emitted_before..pos)) + .unwrap_or_default() + .to_owned() + } else { + first_chunk + }; + if !hit_stop { + send_text.push_str( + &detok + .flush(tokenizer, &[first_token_id]) + .unwrap_or_default(), + ); + } let _ = req.response_tx.blocking_send(StreamingOutput { - new_text: if hit_stop { String::new() } else { first_text }, + new_text: send_text, finished: true, finish_reason: Some(finish_reason.to_owned()), prompt_tokens: prompt_len, @@ -747,11 +767,10 @@ fn prefill_request( } // Send first token - let detok = IncrementalDetok::new(first_text.clone(), 1); if req .response_tx .blocking_send(StreamingOutput { - new_text: first_text, + new_text: first_chunk, finished: false, finish_reason: None, prompt_tokens: prompt_len, @@ -913,49 +932,11 @@ fn materialize_decode_step( finished || disconnected } -/// Check if any stop sequence appears in the text. -fn check_stop_sequences_simple(text: &str, stop_sequences: &[String]) -> bool { - stop_sequences.iter().any(|seq| text.contains(seq.as_str())) -} - #[cfg(test)] #[allow(clippy::panic, clippy::unwrap_used, clippy::indexing_slicing)] mod tests { use super::*; - // ----------------------------------------------------------------------- - // check_stop_sequences_simple - // ----------------------------------------------------------------------- - - #[test] - fn stop_sequences_empty_never_matches() { - assert!(!check_stop_sequences_simple("hello world", &[])); - } - - #[test] - fn stop_sequences_match_at_end() { - let stops = vec!["".to_owned()]; - assert!(check_stop_sequences_simple("some text", &stops)); - } - - #[test] - fn stop_sequences_match_in_middle() { - let stops = vec!["STOP".to_owned()]; - assert!(check_stop_sequences_simple("before STOP after", &stops)); - } - - #[test] - fn stop_sequences_no_match() { - let stops = vec!["".to_owned(), "<|end|>".to_owned()]; - assert!(!check_stop_sequences_simple("normal text", &stops)); - } - - #[test] - fn stop_sequences_multiple_one_matches() { - let stops = vec!["".to_owned(), "\n\n".to_owned()]; - assert!(check_stop_sequences_simple("text\n\nmore", &stops)); - } - // ----------------------------------------------------------------------- // materialize_decode_step // ----------------------------------------------------------------------- diff --git a/crates/higgs-engine/src/simple.rs b/crates/higgs-engine/src/simple.rs index 1d9a0077..3533790f 100644 --- a/crates/higgs-engine/src/simple.rs +++ b/crates/higgs-engine/src/simple.rs @@ -2214,20 +2214,32 @@ impl SimpleEngine { } all_tokens.push(first_token_id); - let first_decoded = self.decode_tokens(&all_tokens)?; - let (first_text, first_hit_stop) = if stop_sequences.is_empty() { - (first_decoded.clone(), false) + let mut detok = IncrementalDetok::new(String::new(), 0); + let new_text = detok.append(&self.tokenizer, &all_tokens)?; + let emitted_before = detok.text.len() - new_text.len(); + let (mut first_text, first_hit_stop) = if stop_sequences.is_empty() { + (new_text, false) } else { - check_stop_sequences(&first_decoded, stop_sequences).map_or_else( - || (first_decoded.clone(), false), - |truncated| (truncated, true), + find_stop_in_tail(&detok.text, new_text.len(), stop_sequences).map_or( + (new_text, false), + |pos| { + let emit = detok + .text + .get(emitted_before..pos) + .unwrap_or_default() + .to_owned(); + (emit, true) + }, ) }; - let mut detok = IncrementalDetok::new(first_decoded, all_tokens.len()); let first_is_eos = self.eos_token_ids.contains(&first_token_id); let finished = first_is_eos || first_hit_stop || 1 >= max_tokens; + if finished && !first_hit_stop { + first_text.push_str(&detok.flush(&self.tokenizer, &all_tokens)?); + } + let first_logprob = first_logprob_data .as_ref() .map(|lp| lp.materialize(first_token_id)); @@ -3041,7 +3053,7 @@ mod tests { }, "model": { "type": "WordLevel", - "vocab": {"Hello": 0, "Ġworld": 1, "!": 2}, + "vocab": {"Hello": 0, "Ġworld": 1, "!": 2, "ðŁĺ": 3, "Ģ": 4}, "unk_token": "Hello" } }"#; @@ -3076,4 +3088,58 @@ mod tests { detok.append(&tokenizer, &tokens).unwrap(); assert_eq!(detok.flush(&tokenizer, &tokens).unwrap(), ""); } + + /// Tokens 3 and 4 are the byte-level pieces of 😀 (U+1F600). + /// Appending only token 3 must hold back the incomplete UTF-8 sequence + /// (return ""), and appending both tokens must emit the full emoji. + #[test] + fn incremental_detok_first_token_partial_utf8_held_back() { + let tokenizer = word_tokenizer(); + let mut detok = IncrementalDetok::new(String::new(), 0); + + // First partial piece: held back, no replacement char emitted + let held = detok.append(&tokenizer, &[3]).unwrap(); + assert_eq!(held, "", "partial UTF-8 token must be held back"); + + // Completing the sequence emits the full emoji + let emitted = detok.append(&tokenizer, &[3, 4]).unwrap(); + assert_eq!(emitted, "😀", "completing UTF-8 should emit the emoji"); + + let full = tokenizer.decode(&[3u32, 4], true).unwrap(); + assert_eq!(detok.text, full, "detok.text must equal full decode"); + } + + /// flush() must emit any bytes held back by append(), and a second flush + /// on the same (now-fully-drained) detok must return "". + #[test] + fn incremental_detok_flush_emits_pending() { + let tokenizer = word_tokenizer(); + let mut detok = IncrementalDetok::new(String::new(), 0); + + // Partial piece held back + let held = detok.append(&tokenizer, &[3]).unwrap(); + assert_eq!(held, "", "partial UTF-8 must be held back before flush"); + + // flush must emit something (a replacement char is acceptable here) + let flushed = detok.flush(&tokenizer, &[3]).unwrap(); + assert!(!flushed.is_empty(), "flush must emit held-back text"); + + // A second flush on an already-drained detok returns "" + let flushed2 = detok.flush(&tokenizer, &[3]).unwrap(); + assert_eq!(flushed2, "", "second flush must return empty string"); + } + + /// find_stop_in_tail must identify a stop whose prefix was already emitted, + /// returning the byte position that lets the caller emit the prefix ("hi") + /// before the stop ("STOP") in the same new-text window. + #[test] + fn find_stop_in_tail_first_token_prefix_and_stop() { + let stops = vec!["STOP".to_owned()]; + // "hiSTOP": all 6 bytes are new, stop starts at byte 2 + assert_eq!( + find_stop_in_tail("hiSTOP", 6, &stops), + Some(2), + "stop at pos 2 so 'hi' can be emitted as prefix" + ); + } } diff --git a/crates/higgs/src/daemon.rs b/crates/higgs/src/daemon.rs index f980c50e..3322abeb 100644 --- a/crates/higgs/src/daemon.rs +++ b/crates/higgs/src/daemon.rs @@ -541,7 +541,7 @@ pub fn detach(config_path: &Path, verbose: bool, profile: Option<&str>) { let _ = child.wait(); }); - if let Err(e) = fs::write(pid_path(profile), child_pid.to_string()) { + if let Err(e) = config::write_private_file(&pid_path(profile), &child_pid.to_string()) { eprintln!("failed to write pid file: {e}"); std::process::exit(1); } From deefbe39345669e101b9f1fc5caa90de0aaaa973 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 11 Jun 2026 02:42:51 +0000 Subject: [PATCH 8/9] fix: doc_markdown lints in detok test comments https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c --- crates/higgs-engine/src/simple.rs | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/crates/higgs-engine/src/simple.rs b/crates/higgs-engine/src/simple.rs index 3533790f..cd1b8025 100644 --- a/crates/higgs-engine/src/simple.rs +++ b/crates/higgs-engine/src/simple.rs @@ -3089,9 +3089,9 @@ mod tests { assert_eq!(detok.flush(&tokenizer, &tokens).unwrap(), ""); } - /// Tokens 3 and 4 are the byte-level pieces of 😀 (U+1F600). - /// Appending only token 3 must hold back the incomplete UTF-8 sequence - /// (return ""), and appending both tokens must emit the full emoji. + // Tokens 3 and 4 are the byte-level pieces of 😀 (U+1F600). + // Appending only token 3 must hold back the incomplete UTF-8 sequence + // (return ""), and appending both tokens must emit the full emoji. #[test] fn incremental_detok_first_token_partial_utf8_held_back() { let tokenizer = word_tokenizer(); @@ -3109,8 +3109,8 @@ mod tests { assert_eq!(detok.text, full, "detok.text must equal full decode"); } - /// flush() must emit any bytes held back by append(), and a second flush - /// on the same (now-fully-drained) detok must return "". + // flush() must emit any bytes held back by append(), and a second flush + // on the same (now-fully-drained) detok must return "". #[test] fn incremental_detok_flush_emits_pending() { let tokenizer = word_tokenizer(); @@ -3129,9 +3129,9 @@ mod tests { assert_eq!(flushed2, "", "second flush must return empty string"); } - /// find_stop_in_tail must identify a stop whose prefix was already emitted, - /// returning the byte position that lets the caller emit the prefix ("hi") - /// before the stop ("STOP") in the same new-text window. + // find_stop_in_tail must identify a stop whose prefix was already emitted, + // returning the byte position that lets the caller emit the prefix ("hi") + // before the stop ("STOP") in the same new-text window. #[test] fn find_stop_in_tail_first_token_prefix_and_stop() { let stops = vec!["STOP".to_owned()]; From 8d16696c2d59750d3d742f203ab7316a4785b2be Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 11 Jun 2026 02:45:49 +0000 Subject: [PATCH 9/9] fix: rename first-token detok bindings to avoid shadowing MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In `generate_streaming_inner` (simple.rs), the first-token block used `new_text` and `emitted_before` as binding names, which were then reused in the main decode loop — triggering `-D clippy::shadow_unrelated`. Rename the first-token block's bindings (lines ~2218-2234) to `first_new_text` and `first_emitted_before` throughout that section (`detok.append` result, `detok.text.len()` subtraction, both arms of the `find_stop_in_tail` map_or, and the range in `.get()`). The main loop's bindings at lines 2389-2390 are unchanged. No logic change; no other shadowing candidates found in commits 722218d or deefbe3 (batch_engine.rs `prefill_request` binds `first_chunk`/`emitted_before` once each; `materialize_decode_step` uses `new_text`/`emitted_before` in a separate function scope). https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c --- crates/higgs-engine/src/simple.rs | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/crates/higgs-engine/src/simple.rs b/crates/higgs-engine/src/simple.rs index cd1b8025..ac011529 100644 --- a/crates/higgs-engine/src/simple.rs +++ b/crates/higgs-engine/src/simple.rs @@ -2215,17 +2215,17 @@ impl SimpleEngine { all_tokens.push(first_token_id); let mut detok = IncrementalDetok::new(String::new(), 0); - let new_text = detok.append(&self.tokenizer, &all_tokens)?; - let emitted_before = detok.text.len() - new_text.len(); + let first_new_text = detok.append(&self.tokenizer, &all_tokens)?; + let first_emitted_before = detok.text.len() - first_new_text.len(); let (mut first_text, first_hit_stop) = if stop_sequences.is_empty() { - (new_text, false) + (first_new_text, false) } else { - find_stop_in_tail(&detok.text, new_text.len(), stop_sequences).map_or( - (new_text, false), + find_stop_in_tail(&detok.text, first_new_text.len(), stop_sequences).map_or( + (first_new_text, false), |pos| { let emit = detok .text - .get(emitted_before..pos) + .get(first_emitted_before..pos) .unwrap_or_default() .to_owned(); (emit, true)