From c955f57f95295d0feffb203f2fcf6a4e13eb5635 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 11 Jun 2026 01:13:18 +0000
Subject: [PATCH 1/9] docs: add June 2026 codebase review with prioritized
 findings

Covers security posture, performance hot paths, correctness, cleanup,
test gaps, and an assessment of Apple's Core AI framework relative to
the MLX-based architecture.

https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c
---
 docs/codebase-review-2026-06.md | 177 ++++++++++++++++++++++++++++++++
 1 file changed, 177 insertions(+)
 create mode 100644 docs/codebase-review-2026-06.md

diff --git a/docs/codebase-review-2026-06.md b/docs/codebase-review-2026-06.md
new file mode 100644
index 00000000..0e01e561
--- /dev/null
+++ b/docs/codebase-review-2026-06.md
@@ -0,0 +1,177 @@
+# Codebase Review — June 2026
+
+A full review of the workspace (`higgs`, `higgs-engine`, `higgs-models`, `higgs-bench`)
+covering security, performance, correctness, cleanup, and an assessment of Apple's
+Core AI framework. File/line references are against the tree at the time of review;
+items marked **[fixed]** were addressed on the branch that introduced this document.
+
+## Overall assessment
+
+The codebase is in good shape. The strict lint policy in `Cargo.toml` is genuinely
+enforced — production code is free of `unwrap`/`panic` (all occurrences are
+`#[cfg(test)]`-guarded), custom Metal kernels are cached correctly in `OnceLock`
+statics, README and docs match actual behavior, and CI covers fmt, clippy
+(pedantic + nursery), tests, MSRV, and a 70% coverage floor.
+
+The significant findings cluster in three areas:
+
+1. An O(n²) streaming decode path in the engine.
+2. Insecure-by-default network posture and world-readable secrets.
+3. Duplication: two prefix-cache implementations, duplicated prefill logic, and
+   near-duplicate route handlers.
+
+## Apple Core AI assessment
+
+[Core AI](https://developer.apple.com/documentation/coreai/) is Apple's WWDC 2026
+on-device inference framework — the runtime behind Apple Intelligence, now public.
+It targets CPU, GPU, and the **Apple Neural Engine (ANE)**, with ahead-of-time model
+compilation for instant load, stateful (KV-cache-style) execution, zero-copy data
+paths, and PyTorch conversion tooling.
+
+**It is Swift-only today.** There is no documented C/C++ API surface, and therefore
+no path to call it from Rust without a hand-rolled Swift shim. Recommendation:
+
+- **Keep MLX (mlx-rs) as the backbone.** It is the right tool for a Rust inference
+  server on Apple Silicon, and higgs already exploits it well (custom Metal kernels
+  for GatedDeltaNet and TurboQuant, `mlx_rs::fast::*` SDPA/RoPE paths).
+- **Future opportunity — ANE offload for auxiliary models.** A small Swift
+  shim/XPC sidecar could run compact models on the ANE while the GPU stays
+  dedicated to the main model. Best candidates: speculative-decode draft models
+  (`higgs-engine/src/mtp.rs`), embedding models, and the SigLIP vision encoder
+  (`higgs-models/src/siglip.rs`). This is R&D, not a near-term refactor.
+- **Cold start.** Core AI's AOT-compiled model assets could eventually reduce model
+  load times if a bridge materializes.
+- **Watch for** a C API or community Rust bindings before committing to anything.
+
+## P0 — Security
+
+1. **Insecure network defaults.** The server defaulted to `0.0.0.0`
+   (`crates/higgs/src/config.rs`, `default_host()`), with no API key required and
+   `CorsLayer::permissive()` unconditionally applied (`crates/higgs/src/lib.rs`).
+   Any host on the LAN could use the server. **[fixed]** Default bind is now
+   `127.0.0.1`, CORS is configurable via `server.cors_origins` (disabled unless
+   set; `"*"` opts into permissive), and `higgs doctor` warns when binding a
+   non-loopback address without an API key.
+2. **World-readable secrets.** Config files containing provider API keys were
+   written with default permissions (0o644) by `higgs init`
+   (`crates/higgs/src/daemon.rs`) and `higgs config set`
+   (`crates/higgs/src/cli_config.rs`). **[fixed]** Config files are now written
+   0o600 on Unix, and doctor warns when an existing config containing an
+   `api_key` is group/world-readable.
+3. **Upstream error bodies leaked into errors/metrics.**
+   `crates/higgs/src/proxy.rs` embedded the entire upstream error body in
+   `ProxyError`, which flows into logs and the metrics store; upstream bodies can
+   echo request headers/keys. **[fixed]** Bodies are truncated to a bounded length.
+4. **Chat template execution surface.** `higgs-engine/src/chat_template.rs`
+   registers all of `minijinja_contrib` plus the pycompat method callback for
+   templates loaded from model directories (`chat_template.jinja` /
+   `tokenizer_config.json`). A malicious model repo therefore gets a fairly rich
+   template language. minijinja has no filesystem/process access so this is
+   bounded, but unbounded loops are possible. **[fixed]** Template execution is
+   now fuel-limited via `Environment::set_fuel`. Follow-up (not done): trim the
+   contrib filter set to what HF templates actually use.
+5. **Missing resource bounds** (recommendation, not implemented):
+   - No configurable prompt-length limit before tokenization/prefill.
+   - Constrained decoding compiles schema-derived structures without a size cap
+     (`higgs-engine/src/constrained.rs`); a pathological JSON schema is a DoS vector.
+   - The engine `sessions` map (`higgs-engine/src/simple.rs`) has no cap or TTL.
+   Suggested: `max_prompt_tokens` and `max_schema_bytes` config fields plus a
+   session cap, all validated by doctor.
+
+## P1 — Performance
+
+6. **O(n²) streaming decode.** Each generated token triggered
+   `decode_tokens(tokens)` over the *entire* completion so far, then sliced off
+   the new suffix; with stop sequences enabled, `check_stop_sequences` also
+   rescanned the full text every step (`higgs-engine/src/simple.rs`,
+   streaming loop). Cost grows quadratically with completion length — measurable
+   on long generations. The batch engine had the identical pattern in
+   `batch_engine.rs`. **[fixed]** Both engines now share an incremental
+   detokenizer that decodes a bounded trailing window of tokens per step
+   (UTF-8 sequences split across tokens are held back until complete, which
+   also fixes replacement-char corruption in streamed output), and stop
+   sequences are scanned over only the new tail plus the maximum
+   stop-sequence overlap. The non-streaming path still re-decodes per token
+   when stop sequences are configured — same recipe applies if it shows up
+   in profiles.
+7. **Prefix cache clones — corrected, no change needed.** An initial pass
+   flagged `PrefixCache::find_longest_prefix` for cloning the entire cached
+   `AnyCache` per lookup (`higgs-engine/src/prompt_cache.rs`). Inspection of
+   mlx-rs shows `Array::clone` is a refcounted handle copy (`mlx_array_set`),
+   so an `AnyCache` clone is O(layers) handle bumps, not a tensor copy — and
+   consumers genuinely need an owned copy because the forward pass mutates it.
+   The remaining open question (macOS-only to verify): a cached handle keeps
+   the underlying buffers alive, which can prevent MLX buffer donation and
+   force copy-on-write during subsequent decode steps. Worth profiling before
+   any restructuring.
+8. **O(n) LRU eviction.** Eviction walks the whole radix tree to find the oldest
+   entry (`prompt_cache.rs::evict_lru`; similar in `paged_prefix_cache.rs`).
+   Acceptable at current cache sizes now that per-lookup clones are gone; revisit
+   with an access-ordered index if `max_entries` grows.
+9. **TurboQuant dtype conversions — corrected, no change needed.** The
+   unconditional `as_dtype(Dtype::Float32)` calls in
+   `higgs-models/src/turboquant.rs` looked like redundant materializations,
+   but MLX core's `astype` short-circuits to a same-handle return when the
+   dtype already matches (`mlx/ops.cpp: if (dtype == a.dtype()) return a;`),
+   so these are no-ops beyond an FFI call.
+10. Lower priority (not implemented): per-request `messages.clone()` in route
+    handlers (`crates/higgs/src/routes/chat.rs`); `Mutex` rather than `RwLock`
+    for read-heavy engine state (`simple.rs`); metrics store uses two `RwLock`s
+    per request (fine at local-server request rates).
+
+## P2 — Correctness
+
+11. **Two prefix-cache implementations.** `prompt_cache.rs` (~670 LOC,
+    clone-based radix tree) and `paged_prefix_cache.rs` (~1,070 LOC, block-paged)
+    coexist; the simple engine uses the paged one while the batch engine uses the
+    old one. Converge on the paged cache and delete the old implementation.
+12. **Client disconnect cleanup.** The batch engine only notices a disconnected
+    client when `blocking_send` fails (`batch_engine.rs`), leaving in-flight
+    request state to be cleaned up late; on the HTTP side, a stream that ends
+    early can leave the metrics record without final token counts
+    (`routes/chat.rs` streaming finalization). Add cancellation-aware cleanup and
+    mark such records as cancelled.
+13. **Error swallowing.** `cache/paged.rs::remove_session` logs only the first
+    block-free error; weight loading warns (rather than errors) on unmatched
+    keys (`higgs-models/src/lib.rs`), which is lenient enough to hide a corrupt
+    or mismatched checkpoint. Consider a strict-loading flag.
+14. **PID-file TOCTTOU.** `read_pid → pid_is_alive → kill` in `daemon.rs` is
+    non-atomic. Low impact (stale-PID cleanup exists); noted for completeness.
+15. **Unimplemented session APIs.** `step()`/`generate_session()` in `simple.rs`
+    return "not implemented" — either finish batched session generation or remove
+    the scaffolding so the API surface matches reality.
+
+## P3 — Cleanup / maintainability
+
+- **Duplicated prefill/decode-graph logic** between `simple.rs` and
+  `batch_engine.rs` (~100 LOC overlap each for prefill and decode-graph
+  construction). Extract shared helpers.
+- **`generate_inner()` is ~1,100 lines** (`simple.rs`) covering standard decode,
+  MTP, prompt-lookup, and thinking-budget paths. Split per strategy.
+- **Route handler duplication.** `routes/chat.rs` and `routes/anthropic.rs`
+  share the parse → route → stream/non-stream → metrics skeleton; several
+  handlers run 300-950 lines. A shared streaming-response helper would remove
+  most of it.
+- **`qwen3_next.rs` is ~15k lines** — inline Metal kernel codegen, GDN layers,
+  MoE, and tests in one file. Split into submodules.
+- **Attention/mask logic duplicated** across `gemma2.rs`, `phi3.rs`,
+  `starcoder2.rs`, `deepseek_v2.rs` despite shared helpers existing in
+  `utils.rs` (which `transformer.rs` already uses for Llama/Qwen2/Mistral).
+- **Hardcoded `THINKING_BUDGET = 256`** (`simple.rs`) — should be a config field
+  (with doctor validation and README/init-template updates per project rules).
+- **mlx-rs git pin** blocks crates.io publishing (release workflow skips publish
+  when git-pinned). Track upstream releases.
+- **Ignored test due to global MLX state** (`higgs-models/src/yarn.rs`) — known
+  Metal/RNG state contamination across tests in one process; documented, but
+  worth revisiting if more tests start interfering.
+
+## Test gaps
+
+- Streaming SSE end-to-end over HTTP (chunk framing, `[DONE]`).
+- Client-disconnect/cancellation paths (engine cleanup, metrics finalization).
+- Cache exhaustion: prefix-cache eviction under pressure, paged-cache block
+  exhaustion, multi-session contention.
+- Adversarial inputs: hostile chat templates, pathological constrained-decoding
+  schemas, very long prompts.
+- Session lifecycle (create → generate → remove) and leak detection.
+- Concurrent-request stress (rate limiter, lock contention).

From 1b41a5c93696a7492ef4bf7e941982341edf50ab Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 11 Jun 2026 01:13:27 +0000
Subject: [PATCH 2/9] feat: harden default security posture

- Bind 127.0.0.1 by default instead of 0.0.0.0; doctor warns when a
  non-loopback bind has no api_key set
- Make CORS opt-in via server.cors_origins (unset = no CORS headers,
  ["*"] = permissive, otherwise an explicit origin allow-list) instead
  of unconditionally permissive; doctor validates entries
- Write config and PID files with 0o600 permissions; doctor warns when
  an existing config containing API keys is group/world-accessible
- Truncate upstream error bodies before embedding them in proxy errors
  and metrics, so upstream responses cannot leak echoed credentials
- Fuel-limit chat template execution (templates load from third-party
  model directories)

https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c
---
 Cargo.toml                                    |   2 +-
 README.md                                     |   2 +
 crates/higgs-engine/src/chat_template.rs      |   9 +
 crates/higgs/src/cli_config.rs                |   2 +-
 crates/higgs/src/config.rs                    |  39 +++-
 crates/higgs/src/daemon.rs                    |  10 +-
 crates/higgs/src/doctor.rs                    | 178 +++++++++++++++++-
 crates/higgs/src/lib.rs                       |  84 ++++++++-
 crates/higgs/src/main.rs                      |  19 +-
 crates/higgs/src/proxy.rs                     |  41 ++++
 .../higgs/tests/integration/api_contract.rs   |   4 +-
 crates/higgs/tests/integration/proxy_e2e.rs   |   2 +-
 docs/configuration.md                         |   9 +-
 13 files changed, 376 insertions(+), 25 deletions(-)

diff --git a/Cargo.toml b/Cargo.toml
index d64fa01f..4e883fda 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -96,7 +96,7 @@ json5 = "1"
 
 # Tokenization + Templates
 tokenizers = { version = "0.23", features = ["http"] }
-minijinja = { version = "2", features = ["loader"] }
+minijinja = { version = "2", features = ["loader", "fuel"] }
 minijinja-contrib = { version = "2", features = ["pycompat"] }
 
 # CLI + Configuration
diff --git a/README.md b/README.md
index f919a833..cb896738 100644
--- a/README.md
+++ b/README.md
@@ -27,6 +27,8 @@ Higgs is a single static Rust binary that serves local models, proxies to provid
 - Exact local model names now beat regex routes.
 - `/metrics` is a real endpoint, and `server.max_body_size` is enforced on API requests.
 - `higgs shellenv` and `higgs exec` now fail fast on bad config or an unreachable server.
+- The server now binds `127.0.0.1` by default (was `0.0.0.0`). Set `server.host = "0.0.0.0"` (and an `api_key`) to expose it on the network.
+- CORS headers are no longer sent unless `server.cors_origins` is set (`["*"]` restores the old permissive behavior).
 
 ## Quick Links
 
diff --git a/crates/higgs-engine/src/chat_template.rs b/crates/higgs-engine/src/chat_template.rs
index b3f27b23..655ed055 100644
--- a/crates/higgs-engine/src/chat_template.rs
+++ b/crates/higgs-engine/src/chat_template.rs
@@ -12,6 +12,11 @@ pub struct ChatMessage {
     pub tool_calls: Option<Vec<serde_json::Value>>,
 }
 
+/// Upper bound on template-engine instructions per render. Generous enough
+/// for complex HF chat templates over long conversations, but stops a
+/// malicious template from looping forever.
+const TEMPLATE_FUEL: u64 = 5_000_000;
+
 /// Renders chat messages using a Jinja2 template (`HuggingFace` format).
 pub struct ChatTemplateRenderer {
     env: Environment<'static>,
@@ -24,6 +29,10 @@ impl ChatTemplateRenderer {
     /// Create a renderer from a Jinja2 template string.
     pub fn new<S: Into<String>>(template_source: S) -> Result<Self, EngineError> {
         let mut env = Environment::new();
+        // Templates come from model directories (tokenizer_config.json /
+        // chat_template.jinja), which are third-party content; bound execution
+        // so a hostile template cannot loop forever.
+        env.set_fuel(Some(TEMPLATE_FUEL));
         env.add_filter("tojson", tojson_filter);
         minijinja_contrib::add_to_environment(&mut env);
         env.set_unknown_method_callback(minijinja_contrib::pycompat::unknown_method_callback);
diff --git a/crates/higgs/src/cli_config.rs b/crates/higgs/src/cli_config.rs
index 362146a6..f3283ce2 100644
--- a/crates/higgs/src/cli_config.rs
+++ b/crates/higgs/src/cli_config.rs
@@ -71,7 +71,7 @@ pub fn config_set(config_path: &Path, key: &str, value: &str) {
             std::process::exit(1);
         });
     }
-    fs::write(config_path, &rendered).unwrap_or_else(|e| {
+    config::write_private_file(config_path, &rendered).unwrap_or_else(|e| {
         eprintln!("failed to write {}: {e}", config_path.display());
         std::process::exit(1);
     });
diff --git a/crates/higgs/src/config.rs b/crates/higgs/src/config.rs
index b9967aa4..7d1fc7f3 100644
--- a/crates/higgs/src/config.rs
+++ b/crates/higgs/src/config.rs
@@ -273,6 +273,9 @@ pub struct ServerSection {
     pub timeout: f64,
     #[serde(default = "default_max_body_size")]
     pub max_body_size: usize,
+    /// CORS allow-list of origins. Unset = no CORS headers are sent;
+    /// `["*"]` allows any origin (permissive).
+    pub cors_origins: Option<Vec<String>>,
 }
 
 impl Default for ServerSection {
@@ -285,12 +288,13 @@ impl Default for ServerSection {
             rate_limit: 0,
             timeout: default_timeout(),
             max_body_size: default_max_body_size(),
+            cors_origins: None,
         }
     }
 }
 
 fn default_host() -> String {
-    "0.0.0.0".to_owned()
+    "127.0.0.1".to_owned()
 }
 
 const fn default_port() -> u16 {
@@ -957,6 +961,24 @@ pub fn default_config_path() -> PathBuf {
     config_dir().join("config.toml")
 }
 
+/// Write a file with owner-only permissions (0o600 on Unix). Used for config
+/// files (which may contain provider API keys) and other daemon-private
+/// files. The mode is applied at creation; existing files keep their
+/// permissions (doctor warns about loose ones).
+pub fn write_private_file(path: &Path, contents: &str) -> std::io::Result<()> {
+    use std::io::Write as _;
+
+    let mut options = std::fs::OpenOptions::new();
+    options.write(true).create(true).truncate(true);
+    #[cfg(unix)]
+    {
+        use std::os::unix::fs::OpenOptionsExt as _;
+        options.mode(0o600);
+    }
+    let mut file = options.open(path)?;
+    file.write_all(contents.as_bytes())
+}
+
 /// Validates that a profile name is safe for use in file paths.
 pub fn validate_profile_name(name: &str) -> Result<(), String> {
     if name.is_empty() {
@@ -1057,7 +1079,7 @@ mod tests {
         assert!(config.models.is_empty());
         assert!(config.providers.is_empty());
         assert!(config.routes.is_empty());
-        assert_eq!(config.server.host, "0.0.0.0");
+        assert_eq!(config.server.host, "127.0.0.1");
         assert_eq!(config.server.port, 8000);
         assert_eq!(config.server.max_tokens, 32768);
         assert!((config.server.timeout - 300.0).abs() < f64::EPSILON);
@@ -1071,6 +1093,19 @@ mod tests {
         assert_eq!(config.default.provider, "higgs");
     }
 
+    #[test]
+    #[cfg(unix)]
+    fn test_write_private_file_owner_only_permissions() {
+        use std::os::unix::fs::PermissionsExt as _;
+        let dir = std::env::temp_dir().join(format!("higgs-cfg-test-{}", std::process::id()));
+        std::fs::create_dir_all(&dir).unwrap();
+        let path = dir.join("config.toml");
+        write_private_file(&path, "[server]\n").unwrap();
+        let mode = std::fs::metadata(&path).unwrap().permissions().mode() & 0o777;
+        assert_eq!(mode, 0o600);
+        let _ = std::fs::remove_dir_all(&dir);
+    }
+
     #[test]
     fn test_simple_mode_builds_models() {
         let args = ServeArgs {
diff --git a/crates/higgs/src/daemon.rs b/crates/higgs/src/daemon.rs
index 244c4a63..f980c50e 100644
--- a/crates/higgs/src/daemon.rs
+++ b/crates/higgs/src/daemon.rs
@@ -47,7 +47,7 @@ pub fn remove_pid_file(profile: Option<&str>) {
 
 pub fn write_pid_file(profile: Option<&str>) {
     let pid = std::process::id();
-    if let Err(e) = fs::write(pid_path(profile), pid.to_string()) {
+    if let Err(e) = config::write_private_file(&pid_path(profile), &pid.to_string()) {
         tracing::warn!("failed to write pid file: {e}");
     }
 }
@@ -128,12 +128,16 @@ pub fn cmd_init(profile: Option<&str>) {
 
     let default_config = format!(
         r#"[server]
-host = "0.0.0.0"
+# Bind to loopback by default. To expose on the network, set host = "0.0.0.0"
+# and set an api_key.
+host = "127.0.0.1"
 port = 8000
 # max_tokens = 32768
 # timeout = 300.0
 # api_key = "sk-..."
 # rate_limit = 0
+# CORS origin allow-list for browser clients. Unset = no CORS headers.
+# cors_origins = ["*"]
 
 # --- Local serving defaults ---
 # MLX profile applies to simple-engine local models. "auto" picks balanced for
@@ -216,7 +220,7 @@ provider = "higgs"
 "#
     );
 
-    if let Err(e) = fs::write(&path, default_config) {
+    if let Err(e) = config::write_private_file(&path, &default_config) {
         eprintln!("failed to write {}: {e}", path.display());
         std::process::exit(1);
     }
diff --git a/crates/higgs/src/doctor.rs b/crates/higgs/src/doctor.rs
index e66f8c40..b570681e 100644
--- a/crates/higgs/src/doctor.rs
+++ b/crates/higgs/src/doctor.rs
@@ -31,7 +31,10 @@ fn fail(msg: &str, result: &mut DoctorResult) {
 }
 
 #[allow(clippy::print_stderr)]
-pub async fn run_doctor(config: &HiggsConfig) -> DoctorResult {
+pub async fn run_doctor(
+    config: &HiggsConfig,
+    config_path: Option<&std::path::Path>,
+) -> DoctorResult {
     let mut result = DoctorResult {
         passes: 0,
         warnings: 0,
@@ -41,6 +44,7 @@ pub async fn run_doctor(config: &HiggsConfig) -> DoctorResult {
     eprintln!("\x1b[1mhiggs doctor\x1b[0m\n");
 
     check_config_valid(&mut result);
+    check_config_file_permissions(config, config_path, &mut result);
     check_server_section(config, &mut result);
     check_models(config, &mut result);
     check_duplicate_models(config, &mut result);
@@ -134,6 +138,23 @@ fn check_server_section(config: &crate::config::HiggsConfig, result: &mut Doctor
         );
     }
 
+    let non_loopback = matches!(
+        server.host.parse::<std::net::IpAddr>(),
+        Ok(ip) if !ip.is_loopback()
+    );
+    if non_loopback && server.api_key.is_none() {
+        warn(
+            &format!(
+                "server.host=\"{}\" is reachable from the network but server.api_key is unset; \
+                 anyone on the network can use this server",
+                server.host
+            ),
+            result,
+        );
+    }
+
+    check_cors_origins(server, non_loopback, result);
+
     if server.rate_limit == 0 {
         pass("server.rate_limit=0 (disabled)", result);
     } else {
@@ -144,6 +165,96 @@ fn check_server_section(config: &crate::config::HiggsConfig, result: &mut Doctor
     }
 }
 
+fn check_cors_origins(
+    server: &crate::config::ServerSection,
+    non_loopback: bool,
+    result: &mut DoctorResult,
+) {
+    match &server.cors_origins {
+        None => pass("server.cors_origins unset; no CORS headers sent", result),
+        Some(origins) if origins.iter().any(|o| o == "*") => {
+            if non_loopback {
+                warn(
+                    "server.cors_origins allows any origin (\"*\") on a network-reachable host; \
+                     consider an explicit origin list",
+                    result,
+                );
+            } else {
+                pass("server.cors_origins=[\"*\"] (permissive)", result);
+            }
+        }
+        Some(origins) => {
+            let mut all_valid = true;
+            for origin in origins {
+                let parses = origin.parse::<http::HeaderValue>().is_ok();
+                if !parses || !(origin.starts_with("http://") || origin.starts_with("https://")) {
+                    fail(
+                        &format!(
+                            "server.cors_origins entry \"{origin}\" is not a valid origin \
+                             (expected e.g. \"https://example.com\")"
+                        ),
+                        result,
+                    );
+                    all_valid = false;
+                }
+            }
+            if all_valid {
+                pass(
+                    &format!("server.cors_origins lists {} origin(s)", origins.len()),
+                    result,
+                );
+            }
+        }
+    }
+}
+
+/// Warn when the config file holding API keys is readable by other users.
+#[cfg(unix)]
+fn check_config_file_permissions(
+    config: &HiggsConfig,
+    config_path: Option<&std::path::Path>,
+    result: &mut DoctorResult,
+) {
+    use std::os::unix::fs::PermissionsExt as _;
+
+    let Some(path) = config_path else { return };
+    let Ok(metadata) = std::fs::metadata(path) else {
+        return;
+    };
+    let mode = metadata.permissions().mode() & 0o777;
+    let has_secrets =
+        config.server.api_key.is_some() || config.providers.values().any(|p| p.api_key.is_some());
+    if (mode & 0o077) == 0 {
+        pass(
+            &format!("config file permissions are owner-only ({mode:03o})"),
+            result,
+        );
+    } else if has_secrets {
+        warn(
+            &format!(
+                "config file {} is group/world-accessible (mode {mode:03o}) and contains API \
+                 keys; run: chmod 600 {}",
+                path.display(),
+                path.display()
+            ),
+            result,
+        );
+    } else {
+        pass(
+            &format!("config file permissions {mode:03o} (no API keys present)"),
+            result,
+        );
+    }
+}
+
+#[cfg(not(unix))]
+fn check_config_file_permissions(
+    _config: &HiggsConfig,
+    _config_path: Option<&std::path::Path>,
+    _result: &mut DoctorResult,
+) {
+}
+
 fn model_label(model: &crate::config::ModelConfig) -> String {
     model.name.as_ref().map_or_else(
         || model.path.clone(),
@@ -896,6 +1007,71 @@ mod tests {
         assert_eq!(result.warnings, 0);
     }
 
+    #[test]
+    fn test_non_loopback_host_without_api_key_warns() {
+        let config = server_with(|s| s.host = "0.0.0.0".to_owned());
+        let mut result = empty_result();
+        check_server_section(&config, &mut result);
+        assert!(result.warnings >= 1);
+        assert_eq!(result.failures, 0);
+    }
+
+    #[test]
+    fn test_non_loopback_host_with_api_key_no_warning() {
+        let config = server_with(|s| {
+            s.host = "0.0.0.0".to_owned();
+            s.api_key = Some("sk-test".to_owned());
+        });
+        let mut result = empty_result();
+        check_server_section(&config, &mut result);
+        assert_eq!(result.warnings, 0);
+        assert_eq!(result.failures, 0);
+    }
+
+    #[test]
+    fn test_cors_wildcard_on_loopback_passes() {
+        let config = server_with(|s| s.cors_origins = Some(vec!["*".to_owned()]));
+        let mut result = empty_result();
+        check_server_section(&config, &mut result);
+        assert_eq!(result.warnings, 0);
+        assert_eq!(result.failures, 0);
+    }
+
+    #[test]
+    fn test_cors_wildcard_on_network_host_warns() {
+        let config = server_with(|s| {
+            s.host = "0.0.0.0".to_owned();
+            s.api_key = Some("sk-test".to_owned());
+            s.cors_origins = Some(vec!["*".to_owned()]);
+        });
+        let mut result = empty_result();
+        check_server_section(&config, &mut result);
+        assert!(result.warnings >= 1);
+        assert_eq!(result.failures, 0);
+    }
+
+    #[test]
+    fn test_cors_valid_origin_list_passes() {
+        let config = server_with(|s| {
+            s.cors_origins = Some(vec![
+                "https://example.com".to_owned(),
+                "http://localhost:3000".to_owned(),
+            ]);
+        });
+        let mut result = empty_result();
+        check_server_section(&config, &mut result);
+        assert_eq!(result.warnings, 0);
+        assert_eq!(result.failures, 0);
+    }
+
+    #[test]
+    fn test_cors_invalid_origin_fails() {
+        let config = server_with(|s| s.cors_origins = Some(vec!["not a url".to_owned()]));
+        let mut result = empty_result();
+        check_server_section(&config, &mut result);
+        assert!(result.failures >= 1);
+    }
+
     // -- Auto router --
 
     #[test]
diff --git a/crates/higgs/src/lib.rs b/crates/higgs/src/lib.rs
index 7534c586..f49da3ad 100644
--- a/crates/higgs/src/lib.rs
+++ b/crates/higgs/src/lib.rs
@@ -30,14 +30,16 @@ use axum::{
     Router,
     extract::DefaultBodyLimit,
     extract::{ConnectInfo, Request},
-    http::StatusCode,
+    http::{HeaderValue, StatusCode},
     middleware::{self, Next},
     response::Response,
     routing::{get, post},
 };
 use governor::{Quota, RateLimiter, clock::DefaultClock, state::keyed::DefaultKeyedStateStore};
 use tower_http::{
-    cors::CorsLayer, timeout::TimeoutLayer, trace::TraceLayer,
+    cors::{Any, CorsLayer},
+    timeout::TimeoutLayer,
+    trace::TraceLayer,
     validate_request::ValidateRequestHeaderLayer,
 };
 
@@ -59,6 +61,7 @@ pub fn build_router(
     api_key: Option<String>,
     rate_limit: u32,
     max_body_size: usize,
+    cors_origins: Option<Vec<String>>,
 ) -> Router {
     let timeout_duration = Duration::from_secs_f64(timeout_secs);
 
@@ -93,16 +96,50 @@ pub fn build_router(
 
     api_routes = api_routes.layer(DefaultBodyLimit::max(max_body_size));
 
-    Router::new()
+    let mut router = Router::new()
         .route("/health", get(routes::health::health))
         .merge(api_routes)
         .layer(TraceLayer::new_for_http())
         .layer(TimeoutLayer::with_status_code(
             StatusCode::GATEWAY_TIMEOUT,
             timeout_duration,
-        ))
-        .layer(CorsLayer::permissive())
-        .with_state(state)
+        ));
+
+    if let Some(cors) = build_cors_layer(cors_origins.as_deref()) {
+        router = router.layer(cors);
+    }
+
+    router.with_state(state)
+}
+
+/// Build a CORS layer from the configured origin allow-list.
+///
+/// `None` (unset) sends no CORS headers; `["*"]` is fully permissive;
+/// anything else is an explicit origin allow-list.
+fn build_cors_layer(origins: Option<&[String]>) -> Option<CorsLayer> {
+    let origins = origins?;
+    if origins.iter().any(|o| o == "*") {
+        return Some(CorsLayer::permissive());
+    }
+    let parsed: Vec<HeaderValue> = origins
+        .iter()
+        .filter_map(|origin| match origin.parse::<HeaderValue>() {
+            Ok(value) => Some(value),
+            Err(_) => {
+                tracing::warn!(origin = %origin, "ignoring invalid CORS origin");
+                None
+            }
+        })
+        .collect();
+    if parsed.is_empty() {
+        return None;
+    }
+    Some(
+        CorsLayer::new()
+            .allow_origin(parsed)
+            .allow_methods(Any)
+            .allow_headers(Any),
+    )
 }
 
 async fn rate_limit_middleware(
@@ -120,3 +157,38 @@ async fn rate_limit_middleware(
         Err(_) => Err(StatusCode::TOO_MANY_REQUESTS),
     }
 }
+
+#[cfg(test)]
+#[allow(clippy::panic, clippy::unwrap_used)]
+mod cors_tests {
+    use super::build_cors_layer;
+
+    #[test]
+    fn unset_origins_disable_cors() {
+        assert!(build_cors_layer(None).is_none());
+    }
+
+    #[test]
+    fn wildcard_enables_permissive_cors() {
+        let origins = vec!["*".to_owned()];
+        assert!(build_cors_layer(Some(&origins)).is_some());
+    }
+
+    #[test]
+    fn explicit_origins_enable_cors() {
+        let origins = vec!["https://example.com".to_owned()];
+        assert!(build_cors_layer(Some(&origins)).is_some());
+    }
+
+    #[test]
+    fn only_invalid_origins_disable_cors() {
+        let origins = vec!["\u{7f}invalid".to_owned()];
+        assert!(build_cors_layer(Some(&origins)).is_none());
+    }
+
+    #[test]
+    fn empty_list_disables_cors() {
+        let origins: Vec<String> = vec![];
+        assert!(build_cors_layer(Some(&origins)).is_none());
+    }
+}
diff --git a/crates/higgs/src/main.rs b/crates/higgs/src/main.rs
index 6da7ed57..40d53405 100644
--- a/crates/higgs/src/main.rs
+++ b/crates/higgs/src/main.rs
@@ -70,22 +70,27 @@ async fn main() -> Result<(), Box<dyn std::error::Error>> {
         }
         Commands::Doctor(ref args) => {
             init_tracing(cli.verbose);
-            let config = if let Some(ref path) = cli.config {
-                config::load_config_file(path, Some(args))?
+            let (config, config_path) = if let Some(ref path) = cli.config {
+                (
+                    config::load_config_file(path, Some(args))?,
+                    Some(path.clone()),
+                )
             } else if cli.profile.is_some() {
                 let path = resolve_config_path(&cli)?;
-                config::load_config_file(&path, Some(args))?
+                let config = config::load_config_file(&path, Some(args))?;
+                (config, Some(path))
             } else {
                 let default = config::default_config_path();
                 if default.exists() {
-                    config::load_config_file(&default, Some(args))?
+                    let config = config::load_config_file(&default, Some(args))?;
+                    (config, Some(default))
                 } else if !args.models.is_empty() {
-                    config::build_simple_config(args)?
+                    (config::build_simple_config(args)?, None)
                 } else {
                     return Err("no config to validate; use --config or 'higgs init'".into());
                 }
             };
-            let result = higgs::doctor::run_doctor(&config).await;
+            let result = higgs::doctor::run_doctor(&config, config_path.as_deref()).await;
             if result.failures > 0 {
                 std::process::exit(1);
             }
@@ -196,6 +201,7 @@ async fn cmd_serve(cli: &Cli, args: &ServeArgs) -> Result<(), Box<dyn std::error
     let api_key = higgs_config.server.api_key.clone();
     let rate_limit = higgs_config.server.rate_limit;
     let max_body_size = higgs_config.server.max_body_size;
+    let cors_origins = higgs_config.server.cors_origins.clone();
     let bind_addr = format!("{}:{}", higgs_config.server.host, higgs_config.server.port);
 
     // Create metrics (config mode only)
@@ -223,6 +229,7 @@ async fn cmd_serve(cli: &Cli, args: &ServeArgs) -> Result<(), Box<dyn std::error
         api_key,
         rate_limit,
         max_body_size,
+        cors_origins,
     );
 
     // Start server
diff --git a/crates/higgs/src/proxy.rs b/crates/higgs/src/proxy.rs
index 8f6ed5a2..406f2d0a 100644
--- a/crates/higgs/src/proxy.rs
+++ b/crates/higgs/src/proxy.rs
@@ -8,6 +8,24 @@ use futures::TryStreamExt;
 
 use crate::error::ServerError;
 
+/// Maximum number of upstream error-body bytes embedded in errors/metrics.
+/// Upstream error bodies can echo request headers (including credentials),
+/// so they are truncated rather than stored verbatim.
+const MAX_UPSTREAM_ERROR_BYTES: usize = 2048;
+
+/// Truncate `text` to at most `max_bytes`, respecting UTF-8 char boundaries.
+fn truncate_error_body(text: &str, max_bytes: usize) -> String {
+    if text.len() <= max_bytes {
+        return text.to_owned();
+    }
+    let mut end = max_bytes;
+    while !text.is_char_boundary(end) {
+        end -= 1;
+    }
+    let truncated = text.get(..end).unwrap_or_default();
+    format!("{truncated}... (truncated, {} bytes total)", text.len())
+}
+
 fn is_hop_by_hop(name: &http::header::HeaderName) -> bool {
     matches!(
         name.as_str(),
@@ -181,6 +199,7 @@ pub async fn send_to_provider(
             .text()
             .await
             .unwrap_or_else(|_| String::from("(failed to read error body)"));
+        let error_body = truncate_error_body(&error_body, MAX_UPSTREAM_ERROR_BYTES);
         return Err(ServerError::ProxyError(format!(
             "upstream returned HTTP {status}: {error_body}"
         )));
@@ -262,6 +281,28 @@ pub fn extract_usage(body: &[u8]) -> (u64, u64) {
 mod tests {
     use super::*;
 
+    #[test]
+    fn short_error_body_unchanged() {
+        assert_eq!(truncate_error_body("bad request", 2048), "bad request");
+    }
+
+    #[test]
+    fn long_error_body_truncated_with_marker() {
+        let body = "x".repeat(5000);
+        let truncated = truncate_error_body(&body, 2048);
+        assert!(truncated.starts_with(&"x".repeat(2048)));
+        assert!(truncated.contains("truncated, 5000 bytes total"));
+    }
+
+    #[test]
+    fn truncation_respects_utf8_boundaries() {
+        // 3-byte chars; a 4-byte cut would land mid-codepoint
+        let body = "錯".repeat(10);
+        let truncated = truncate_error_body(&body, 4);
+        assert!(truncated.starts_with('錯'));
+        assert!(truncated.contains("truncated"));
+    }
+
     #[test]
     fn hop_by_hop_headers_filtered() {
         assert!(is_hop_by_hop(&http::header::HeaderName::from_static(
diff --git a/crates/higgs/tests/integration/api_contract.rs b/crates/higgs/tests/integration/api_contract.rs
index 41bd490c..d8716350 100644
--- a/crates/higgs/tests/integration/api_contract.rs
+++ b/crates/higgs/tests/integration/api_contract.rs
@@ -70,7 +70,7 @@ async fn metrics_endpoint_returns_snapshot_json() {
         error_body: None,
     });
 
-    let app = build_router(build_test_state(Some(metrics)), 300.0, None, 0, 1024);
+    let app = build_router(build_test_state(Some(metrics)), 300.0, None, 0, 1024, None);
     let response = app
         .oneshot(
             Request::builder()
@@ -93,7 +93,7 @@ async fn metrics_endpoint_returns_snapshot_json() {
 
 #[tokio::test]
 async fn request_body_limit_is_enforced() {
-    let app = build_router(build_test_state(None), 300.0, None, 0, 64);
+    let app = build_router(build_test_state(None), 300.0, None, 0, 64, None);
     let body = serde_json::json!({
         "model": "gpt-4o",
         "messages": [{"role": "user", "content": "x".repeat(512)}]
diff --git a/crates/higgs/tests/integration/proxy_e2e.rs b/crates/higgs/tests/integration/proxy_e2e.rs
index 7c987476..b92b5c98 100644
--- a/crates/higgs/tests/integration/proxy_e2e.rs
+++ b/crates/higgs/tests/integration/proxy_e2e.rs
@@ -90,7 +90,7 @@ fn openai_chat_request_body() -> serde_json::Value {
 }
 
 fn build_app(state: Arc<AppState>) -> axum::Router {
-    higgs::build_router(state, 300.0, None, 0, 10 * 1024 * 1024)
+    higgs::build_router(state, 300.0, None, 0, 10 * 1024 * 1024, None)
 }
 
 fn post_json(uri: &str, body: &serde_json::Value) -> Request<Body> {
diff --git a/docs/configuration.md b/docs/configuration.md
index bf50abbf..8644566b 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -14,7 +14,7 @@ This document collects the full CLI, environment, and config-file reference for
 | CLI Flag | Env Variable | Default | Description |
 |---|---|---|---|
 | `--model` | `HIGGS_MODELS` | *(required)* | Model path or HF ID (repeatable) |
-| `--host` | `HIGGS_HOST` | `0.0.0.0` | Bind address |
+| `--host` | `HIGGS_HOST` | `127.0.0.1` | Bind address (set `0.0.0.0` to expose on the network; pair with `--api-key`) |
 | `--port` | `HIGGS_PORT` | `8000` | Bind port |
 | `--max-tokens` | `HIGGS_MAX_TOKENS` | `32768` | Max generation tokens |
 | `--api-key` | `HIGGS_API_KEY` | *(none)* | Bearer token for auth |
@@ -52,13 +52,18 @@ Run `higgs init` to create `~/.config/higgs/config.toml`:
 
 ```toml
 [server]
-host = "0.0.0.0"
+# Bind to loopback by default. To expose on the network, set host = "0.0.0.0"
+# and set an api_key.
+host = "127.0.0.1"
 port = 8000
 # max_tokens = 32768
 # timeout = 300.0
 # max_body_size = 10485760
 # api_key = "sk-..."
 # rate_limit = 0
+# CORS origin allow-list for browser clients. Unset = no CORS headers;
+# ["*"] allows any origin.
+# cors_origins = ["https://app.example.com"]
 
 # --- Local defaults ---
 [local]

From a949e50fb3a5c2d650c2d7de367eb4afea444138 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 11 Jun 2026 01:13:35 +0000
Subject: [PATCH 3/9] perf: incremental detokenization for streaming generation

Both streaming engines re-decoded the entire completion token buffer on
every generated token and rescanned the full text for stop sequences,
making streaming O(n^2) in completion length.

Replace with a shared IncrementalDetok that decodes only a bounded
trailing token window per step and emits the diff; stop sequences are
scanned over the new tail plus the maximum stop-sequence overlap only.
UTF-8 sequences split across tokens are held back until complete (and
flushed at finish), which also stops replacement-char corruption in
streamed chunks.

https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c
---
 crates/higgs-engine/src/batch_engine.rs |  91 ++-----
 crates/higgs-engine/src/simple.rs       | 326 ++++++++++++++++++++----
 2 files changed, 297 insertions(+), 120 deletions(-)

diff --git a/crates/higgs-engine/src/batch_engine.rs b/crates/higgs-engine/src/batch_engine.rs
index 8b3d640e..0fb68024 100644
--- a/crates/higgs-engine/src/batch_engine.rs
+++ b/crates/higgs-engine/src/batch_engine.rs
@@ -27,6 +27,7 @@ use crate::{
     error::EngineError,
     model_loader,
     prompt_cache::PrefixCache,
+    simple::{IncrementalDetok, find_stop_in_tail},
 };
 
 /// Default maximum number of cached prefixes.
@@ -63,7 +64,7 @@ struct ActiveRequest {
     constraint: Option<crate::constrained::ConstrainedGenerator>,
     response_tx: tokio::sync::mpsc::Sender<StreamingOutput>,
     prompt_len: u32,
-    prev_decoded_len: usize,
+    detok: IncrementalDetok,
 }
 
 // ---------------------------------------------------------------------------
@@ -746,7 +747,7 @@ fn prefill_request(
         }
 
         // Send first token
-        let prev_decoded_len = first_text.len();
+        let detok = IncrementalDetok::new(first_text.clone(), 1);
         if req
             .response_tx
             .blocking_send(StreamingOutput {
@@ -773,7 +774,7 @@ fn prefill_request(
             constraint: req.constraint,
             response_tx: req.response_tx,
             prompt_len,
-            prev_decoded_len,
+            detok,
         }))
     })
 }
@@ -852,23 +853,21 @@ fn materialize_decode_step(
 
     let completion_len: u32 = ar.generated_tokens.len().try_into().unwrap_or(u32::MAX);
 
-    // Decode full text for diff and stop sequence checking
-    let full_text = tokenizer
-        .decode(&ar.generated_tokens, true)
+    // Decode only the trailing token window for diff and stop checking
+    let new_text = ar
+        .detok
+        .append(tokenizer, &ar.generated_tokens)
         .unwrap_or_default();
-    let new_text = full_text
-        .get(ar.prev_decoded_len..)
-        .unwrap_or_default()
-        .to_owned();
-    let old_decoded_len = ar.prev_decoded_len;
-    ar.prev_decoded_len = full_text.len();
-
-    let (final_new_text, hit_stop) = if ar.stop_sequences.is_empty() {
+    let emitted_before = ar.detok.text.len() - new_text.len();
+
+    let (mut final_new_text, hit_stop) = if ar.stop_sequences.is_empty() {
         (new_text, false)
-    } else if check_stop_sequences_simple(&full_text, &ar.stop_sequences) {
-        let truncated = truncate_at_stop(&full_text, &ar.stop_sequences);
-        let emit = truncated
-            .get(old_decoded_len..)
+    } else if let Some(pos) = find_stop_in_tail(&ar.detok.text, new_text.len(), &ar.stop_sequences)
+    {
+        let emit = ar
+            .detok
+            .text
+            .get(emitted_before..pos)
             .unwrap_or_default()
             .to_owned();
         (emit, true)
@@ -884,6 +883,13 @@ fn materialize_decode_step(
         .is_some_and(crate::constrained::ConstrainedGenerator::is_finished);
 
     let finished = is_eos || at_max || hit_stop || constraint_done;
+    if finished && !hit_stop {
+        final_new_text.push_str(
+            &ar.detok
+                .flush(tokenizer, &ar.generated_tokens)
+                .unwrap_or_default(),
+        );
+    }
     let finish_reason = if is_eos || hit_stop || constraint_done {
         Some("stop".to_owned())
     } else if at_max {
@@ -912,20 +918,6 @@ fn check_stop_sequences_simple(text: &str, stop_sequences: &[String]) -> bool {
     stop_sequences.iter().any(|seq| text.contains(seq.as_str()))
 }
 
-/// Truncate text at the earliest stop sequence.
-fn truncate_at_stop(text: &str, stop_sequences: &[String]) -> String {
-    let mut earliest: Option<usize> = None;
-    for seq in stop_sequences {
-        if let Some(pos) = text.find(seq.as_str()) {
-            earliest = Some(earliest.map_or(pos, |prev| prev.min(pos)));
-        }
-    }
-    earliest.map_or_else(
-        || text.to_owned(),
-        |pos| text.get(..pos).unwrap_or_default().to_owned(),
-    )
-}
-
 #[cfg(test)]
 #[allow(clippy::panic, clippy::unwrap_used, clippy::indexing_slicing)]
 mod tests {
@@ -964,39 +956,6 @@ mod tests {
         assert!(check_stop_sequences_simple("text\n\nmore", &stops));
     }
 
-    // -----------------------------------------------------------------------
-    // truncate_at_stop
-    // -----------------------------------------------------------------------
-
-    #[test]
-    fn truncate_no_stop_returns_full_text() {
-        let stops = vec!["</s>".to_owned()];
-        assert_eq!(truncate_at_stop("hello world", &stops), "hello world");
-    }
-
-    #[test]
-    fn truncate_at_stop_removes_suffix() {
-        let stops = vec!["</s>".to_owned()];
-        assert_eq!(truncate_at_stop("hello</s>", &stops), "hello");
-    }
-
-    #[test]
-    fn truncate_at_earliest_of_multiple_stops() {
-        let stops = vec!["BBB".to_owned(), "AAA".to_owned()];
-        assert_eq!(truncate_at_stop("xAAAyBBBz", &stops), "x");
-    }
-
-    #[test]
-    fn truncate_empty_stops() {
-        assert_eq!(truncate_at_stop("hello", &[]), "hello");
-    }
-
-    #[test]
-    fn truncate_stop_at_start() {
-        let stops = vec!["STOP".to_owned()];
-        assert_eq!(truncate_at_stop("STOPrest", &stops), "");
-    }
-
     // -----------------------------------------------------------------------
     // materialize_decode_step
     // -----------------------------------------------------------------------
@@ -1017,7 +976,7 @@ mod tests {
             constraint: None,
             response_tx: tx,
             prompt_len: 5,
-            prev_decoded_len: 0,
+            detok: IncrementalDetok::new(String::new(), 0),
         };
         (ar, rx)
     }
diff --git a/crates/higgs-engine/src/simple.rs b/crates/higgs-engine/src/simple.rs
index 9b37e73d..c22b4c60 100644
--- a/crates/higgs-engine/src/simple.rs
+++ b/crates/higgs-engine/src/simple.rs
@@ -1833,7 +1833,7 @@ impl SimpleEngine {
         tokens: &mut Vec<u32>,
         stop_sequences: &[String],
         sender: &tokio::sync::mpsc::Sender<StreamingOutput>,
-        mut prev_decoded_len: usize,
+        mut detok: IncrementalDetok,
         enable_thinking: bool,
     ) -> Result<(), EngineError> {
         let has_stop_sequences = !stop_sequences.is_empty();
@@ -1956,36 +1956,25 @@ impl SimpleEngine {
                                 let completion_len = Self::completion_len(tokens)?;
                                 let is_max = completion_len >= max_tokens;
 
-                                let full_text = self.decode_tokens(tokens)?;
-                                let (final_new_text, hit_stop_seq) = if has_stop_sequences {
-                                    check_stop_sequences(&full_text, stop_sequences).map_or_else(
-                                        || {
-                                            (
-                                                full_text
-                                                    .get(prev_decoded_len..)
-                                                    .unwrap_or_default()
-                                                    .to_owned(),
-                                                false,
-                                            )
-                                        },
-                                        |truncated| {
-                                            let emit = truncated
-                                                .get(prev_decoded_len..)
+                                let new_text = detok.append(&self.tokenizer, tokens)?;
+                                let emitted_before = detok.text.len() - new_text.len();
+                                let (mut final_new_text, hit_stop_seq) = if has_stop_sequences {
+                                    find_stop_in_tail(&detok.text, new_text.len(), stop_sequences)
+                                        .map_or((new_text, false), |pos| {
+                                            let emit = detok
+                                                .text
+                                                .get(emitted_before..pos)
                                                 .unwrap_or_default()
                                                 .to_owned();
                                             (emit, true)
-                                        },
-                                    )
+                                        })
                                 } else {
-                                    (
-                                        full_text
-                                            .get(prev_decoded_len..)
-                                            .unwrap_or_default()
-                                            .to_owned(),
-                                        false,
-                                    )
+                                    (new_text, false)
                                 };
                                 let step_finished = is_eos || is_max || hit_stop_seq;
+                                if step_finished && !hit_stop_seq {
+                                    final_new_text.push_str(&detok.flush(&self.tokenizer, tokens)?);
+                                }
                                 let finish_reason = if is_eos || hit_stop_seq {
                                     Some("stop".to_owned())
                                 } else if is_max {
@@ -1993,7 +1982,6 @@ impl SimpleEngine {
                                 } else {
                                     None
                                 };
-                                prev_decoded_len = full_text.len();
 
                                 if step_finished {
                                     let elapsed = t_start.elapsed();
@@ -2034,20 +2022,16 @@ impl SimpleEngine {
                 let completion_len = Self::completion_len(tokens)?;
                 let is_max = completion_len >= max_tokens;
 
-                let full_text = self.decode_tokens(tokens)?;
-                let new_text = full_text
-                    .get(prev_decoded_len..)
-                    .unwrap_or_default()
-                    .to_owned();
-                let old_decoded_len = prev_decoded_len;
-                prev_decoded_len = full_text.len();
+                let new_text = detok.append(&self.tokenizer, tokens)?;
+                let emitted_before = detok.text.len() - new_text.len();
 
-                let (final_new_text, hit_stop_seq) = if has_stop_sequences {
-                    check_stop_sequences(&full_text, stop_sequences).map_or(
+                let (mut final_new_text, hit_stop_seq) = if has_stop_sequences {
+                    find_stop_in_tail(&detok.text, new_text.len(), stop_sequences).map_or(
                         (new_text, false),
-                        |truncated| {
-                            let emit = truncated
-                                .get(old_decoded_len..)
+                        |pos| {
+                            let emit = detok
+                                .text
+                                .get(emitted_before..pos)
                                 .unwrap_or_default()
                                 .to_owned();
                             (emit, true)
@@ -2058,6 +2042,9 @@ impl SimpleEngine {
                 };
 
                 let step_finished = is_eos || is_max || hit_stop_seq;
+                if step_finished && !hit_stop_seq {
+                    final_new_text.push_str(&detok.flush(&self.tokenizer, tokens)?);
+                }
                 let finish_reason = if is_eos || hit_stop_seq {
                     Some("stop".to_owned())
                 } else if is_max {
@@ -2236,7 +2223,7 @@ impl SimpleEngine {
                 |truncated| (truncated, true),
             )
         };
-        let mut prev_decoded_len = first_decoded.len();
+        let mut detok = IncrementalDetok::new(first_decoded, all_tokens.len());
 
         let first_is_eos = self.eos_token_ids.contains(&first_token_id);
         let finished = first_is_eos || first_hit_stop || 1 >= max_tokens;
@@ -2289,7 +2276,7 @@ impl SimpleEngine {
                 &mut all_tokens,
                 stop_sequences,
                 sender,
-                prev_decoded_len,
+                detok,
                 enable_thinking,
             );
         }
@@ -2387,22 +2374,18 @@ impl SimpleEngine {
 
             let completion_len = Self::completion_len(&all_tokens)?;
 
-            let full_text = self.decode_tokens(&all_tokens)?;
-            let new_text = full_text
-                .get(prev_decoded_len..)
-                .unwrap_or_default()
-                .to_owned();
-            let old_decoded_len = prev_decoded_len;
-            prev_decoded_len = full_text.len();
+            let new_text = detok.append(&self.tokenizer, &all_tokens)?;
+            let emitted_before = detok.text.len() - new_text.len();
 
-            let (final_new_text, hit_stop_seq) = if stop_sequences.is_empty() {
+            let (mut final_new_text, hit_stop_seq) = if stop_sequences.is_empty() {
                 (new_text, false)
             } else {
-                check_stop_sequences(&full_text, stop_sequences).map_or(
+                find_stop_in_tail(&detok.text, new_text.len(), stop_sequences).map_or(
                     (new_text, false),
-                    |truncated| {
-                        let emit = truncated
-                            .get(old_decoded_len..)
+                    |pos| {
+                        let emit = detok
+                            .text
+                            .get(emitted_before..pos)
                             .unwrap_or_default()
                             .to_owned();
                         (emit, true)
@@ -2417,6 +2400,10 @@ impl SimpleEngine {
                 .is_some_and(crate::constrained::ConstrainedGenerator::is_finished);
             let step_finished = is_eos || is_max || hit_stop_seq || constraint_done;
 
+            if step_finished && !hit_stop_seq {
+                final_new_text.push_str(&detok.flush(&self.tokenizer, &all_tokens)?);
+            }
+
             let finish_reason = if is_eos || hit_stop_seq || constraint_done {
                 Some("stop".to_owned())
             } else if is_max {
@@ -2460,6 +2447,134 @@ impl SimpleEngine {
     }
 }
 
+/// Cap on the trailing token window re-decoded per streaming step.
+/// Generous for any multi-token UTF-8 sequence; prevents a stream of
+/// undecodable tokens from growing the window without bound.
+const MAX_DETOK_WINDOW: usize = 64;
+
+/// Incremental streaming detokenizer.
+///
+/// Re-decoding the full completion on every generated token is O(n^2) in
+/// completion length. Instead, decode only `tokens[prefix_offset..]` and emit
+/// the difference against `tokens[prefix_offset..read_offset]`; both windows
+/// start at the same token, so tokenizer boundary effects cancel out. Text
+/// ending in an incomplete UTF-8 sequence (trailing replacement char) is held
+/// back until a later token completes it.
+pub(crate) struct IncrementalDetok {
+    /// Start of the decode window; advanced whenever text is emitted.
+    prefix_offset: usize,
+    /// Number of leading tokens already represented in `text`.
+    read_offset: usize,
+    /// All text decoded so far (streamed to the client incrementally).
+    pub(crate) text: String,
+}
+
+impl IncrementalDetok {
+    /// Start from text already decoded for the first `token_count` tokens.
+    pub(crate) const fn new(text: String, token_count: usize) -> Self {
+        Self {
+            prefix_offset: 0,
+            read_offset: token_count,
+            text,
+        }
+    }
+
+    fn decode(tokenizer: &Tokenizer, tokens: &[u32]) -> Result<String, EngineError> {
+        tokenizer
+            .decode(tokens, true)
+            .map_err(|e| EngineError::Tokenization(e.to_string()))
+    }
+
+    /// Decode the trailing window of `tokens`, appending newly stable text to
+    /// `self.text` and returning it.
+    pub(crate) fn append(
+        &mut self,
+        tokenizer: &Tokenizer,
+        tokens: &[u32],
+    ) -> Result<String, EngineError> {
+        let prefix_tokens = tokens
+            .get(self.prefix_offset..self.read_offset)
+            .unwrap_or_default();
+        let window_tokens = tokens.get(self.prefix_offset..).unwrap_or_default();
+        let prefix_text = Self::decode(tokenizer, prefix_tokens)?;
+        let window_text = Self::decode(tokenizer, window_tokens)?;
+
+        let over_window = window_tokens.len() > MAX_DETOK_WINDOW;
+        if window_text.len() > prefix_text.len()
+            && (!window_text.ends_with('\u{FFFD}') || over_window)
+        {
+            let new_text = window_text
+                .get(prefix_text.len()..)
+                .unwrap_or_default()
+                .to_owned();
+            self.prefix_offset = self.read_offset;
+            self.read_offset = tokens.len();
+            self.text.push_str(&new_text);
+            return Ok(new_text);
+        }
+        if over_window && window_text.len() == prefix_text.len() {
+            // The pending tokens decode to nothing (e.g. skipped special
+            // tokens); drop them so the window stays bounded.
+            self.prefix_offset = tokens.len();
+            self.read_offset = tokens.len();
+        }
+        Ok(String::new())
+    }
+
+    /// Emit any text still held back by `append` (a trailing incomplete UTF-8
+    /// sequence). Called when generation finishes so the total streamed text
+    /// matches a full decode of the token buffer.
+    pub(crate) fn flush(
+        &mut self,
+        tokenizer: &Tokenizer,
+        tokens: &[u32],
+    ) -> Result<String, EngineError> {
+        if self.read_offset >= tokens.len() {
+            return Ok(String::new());
+        }
+        let prefix_tokens = tokens
+            .get(self.prefix_offset..self.read_offset)
+            .unwrap_or_default();
+        let window_tokens = tokens.get(self.prefix_offset..).unwrap_or_default();
+        let prefix_text = Self::decode(tokenizer, prefix_tokens)?;
+        let window_text = Self::decode(tokenizer, window_tokens)?;
+        let new_text = window_text
+            .get(prefix_text.len()..)
+            .unwrap_or_default()
+            .to_owned();
+        self.prefix_offset = self.read_offset;
+        self.read_offset = tokens.len();
+        self.text.push_str(&new_text);
+        Ok(new_text)
+    }
+}
+
+/// Find the earliest stop-sequence occurrence that could involve the newly
+/// appended text, scanning only the tail of `text` rather than the whole
+/// buffer. Returns the absolute byte position where the match starts.
+pub(crate) fn find_stop_in_tail(
+    text: &str,
+    new_len: usize,
+    stop_sequences: &[String],
+) -> Option<usize> {
+    let max_stop = stop_sequences.iter().map(String::len).max().unwrap_or(0);
+    if max_stop == 0 {
+        return None;
+    }
+    let mut start = text.len().saturating_sub(new_len + max_stop - 1);
+    while !text.is_char_boundary(start) {
+        start -= 1;
+    }
+    let tail = text.get(start..)?;
+    let mut earliest: Option<usize> = None;
+    for seq in stop_sequences {
+        if let Some(pos) = tail.find(seq.as_str()) {
+            earliest = Some(earliest.map_or(pos, |prev| prev.min(pos)));
+        }
+    }
+    earliest.map(|pos| start + pos)
+}
+
 /// Check if any stop sequence appears in the generated text.
 /// Returns `Some(truncated_text)` if a stop sequence was found, None otherwise.
 fn check_stop_sequences(text: &str, stop_sequences: &[String]) -> Option<String> {
@@ -2587,8 +2702,8 @@ fn detect_thinking_support(model_dir: &Path) -> bool {
 #[allow(clippy::panic, clippy::unwrap_used)]
 mod tests {
     use super::{
-        adaptive_draft_depth_for_cap, check_stop_sequences, derive_model_name,
-        estimate_paged_kv_blocks, parse_enabled_flag,
+        IncrementalDetok, Tokenizer, adaptive_draft_depth_for_cap, check_stop_sequences,
+        derive_model_name, estimate_paged_kv_blocks, find_stop_in_tail, parse_enabled_flag,
     };
     use std::path::Path;
 
@@ -2857,4 +2972,107 @@ mod tests {
         assert_eq!(parse_enabled_flag(Some("no")), Some(false));
         assert_eq!(parse_enabled_flag(Some("unexpected")), None);
     }
+
+    // -----------------------------------------------------------------------
+    // find_stop_in_tail
+    // -----------------------------------------------------------------------
+
+    #[test]
+    fn find_stop_in_tail_empty_stops_returns_none() {
+        assert_eq!(find_stop_in_tail("hello world", 5, &[]), None);
+    }
+
+    #[test]
+    fn find_stop_in_tail_finds_stop_in_new_text() {
+        let stops = vec!["</s>".to_owned()];
+        // "hello</s>" with the last 4 bytes new
+        assert_eq!(find_stop_in_tail("hello</s>", 4, &stops), Some(5));
+    }
+
+    #[test]
+    fn find_stop_in_tail_finds_stop_spanning_boundary() {
+        let stops = vec!["STOP".to_owned()];
+        // "abcSTOP" where only "OP" is new; "ST" was emitted previously
+        assert_eq!(find_stop_in_tail("abcSTOP", 2, &stops), Some(3));
+    }
+
+    #[test]
+    fn find_stop_in_tail_ignores_stop_fully_in_old_text() {
+        let stops = vec!["XY".to_owned()];
+        // "XYabcdefgh" with 2 new bytes: the scan window covers the last
+        // 2 + (2 - 1) = 3 bytes only, so the old "XY" is not rescanned.
+        assert_eq!(find_stop_in_tail("XYabcdefgh", 2, &stops), None);
+    }
+
+    #[test]
+    fn find_stop_in_tail_earliest_of_multiple() {
+        let stops = vec!["BBB".to_owned(), "AAA".to_owned()];
+        assert_eq!(find_stop_in_tail("xAAAyBBBz", 9, &stops), Some(1));
+    }
+
+    #[test]
+    fn find_stop_in_tail_handles_multibyte_boundary() {
+        let stops = vec!["端".to_owned()];
+        // The tail start can land mid-codepoint; it must back up to a
+        // char boundary instead of panicking. "日本語端" = 4 chars, 12 bytes.
+        assert_eq!(find_stop_in_tail("日本語端", 3, &stops), Some(9));
+    }
+
+    // -----------------------------------------------------------------------
+    // IncrementalDetok
+    // -----------------------------------------------------------------------
+
+    /// Minimal word-level tokenizer with a byte-level decoder for detok tests.
+    fn word_tokenizer() -> Tokenizer {
+        let json = r#"{
+            "version": "1.0",
+            "truncation": null,
+            "padding": null,
+            "added_tokens": [],
+            "normalizer": null,
+            "pre_tokenizer": null,
+            "post_processor": null,
+            "decoder": {
+                "type": "ByteLevel",
+                "add_prefix_space": true,
+                "trim_offsets": true,
+                "use_regex": true
+            },
+            "model": {
+                "type": "WordLevel",
+                "vocab": {"Hello": 0, "Ġworld": 1, "!": 2},
+                "unk_token": "Hello"
+            }
+        }"#;
+        Tokenizer::from_bytes(json.as_bytes()).unwrap()
+    }
+
+    #[test]
+    fn incremental_detok_emits_per_token_diffs() {
+        let tokenizer = word_tokenizer();
+        let mut tokens: Vec<u32> = vec![0];
+        let first = tokenizer.decode(&tokens, true).unwrap();
+        let mut detok = IncrementalDetok::new(first.clone(), tokens.len());
+
+        tokens.push(1);
+        let second = detok.append(&tokenizer, &tokens).unwrap();
+        tokens.push(2);
+        let third = detok.append(&tokenizer, &tokens).unwrap();
+
+        let full = tokenizer.decode(&tokens, true).unwrap();
+        assert_eq!(format!("{first}{second}{third}"), full);
+        assert_eq!(detok.text, full);
+    }
+
+    #[test]
+    fn incremental_detok_flush_without_pending_is_empty() {
+        let tokenizer = word_tokenizer();
+        let mut tokens: Vec<u32> = vec![0];
+        let first = tokenizer.decode(&tokens, true).unwrap();
+        let mut detok = IncrementalDetok::new(first, tokens.len());
+
+        tokens.push(1);
+        detok.append(&tokenizer, &tokens).unwrap();
+        assert_eq!(detok.flush(&tokenizer, &tokens).unwrap(), "");
+    }
 }

From 0d8e2594809589fe405a3b8091e669164b2e8e52 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 11 Jun 2026 01:39:47 +0000
Subject: [PATCH 4/9] fix: address CI lint and test failures

- doctor.rs: replace (mode & 0o077) == 0 with mode.trailing_zeros() >= 6
  to fix clippy::verbose_bit_mask (denied by -D warnings)
- proxy.rs: rename error_body to error_body_raw before truncation to fix
  clippy::shadow_reuse (denied lint)
- lib.rs: rename build_cors_layer param origins -> origins_opt to fix
  clippy::shadow_reuse on the `let origins = origins_opt?` unwrap;
  rewrite filter_map match to Result::map_or_else to fix
  clippy::single_match_else and clippy::option_if_let_else (both denied)
- tests/integration/router.rs:184: update build_router_is_public_and_callable
  type assertion to match new 6-parameter signature (cors_origins was added)

https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c
---
 crates/higgs/src/doctor.rs               |  2 +-
 crates/higgs/src/lib.rs                  | 18 ++++++++++--------
 crates/higgs/src/proxy.rs                |  4 ++--
 crates/higgs/tests/integration/router.rs |  3 ++-
 4 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/crates/higgs/src/doctor.rs b/crates/higgs/src/doctor.rs
index b570681e..7095ba73 100644
--- a/crates/higgs/src/doctor.rs
+++ b/crates/higgs/src/doctor.rs
@@ -224,7 +224,7 @@ fn check_config_file_permissions(
     let mode = metadata.permissions().mode() & 0o777;
     let has_secrets =
         config.server.api_key.is_some() || config.providers.values().any(|p| p.api_key.is_some());
-    if (mode & 0o077) == 0 {
+    if mode.trailing_zeros() >= 6 {
         pass(
             &format!("config file permissions are owner-only ({mode:03o})"),
             result,
diff --git a/crates/higgs/src/lib.rs b/crates/higgs/src/lib.rs
index f49da3ad..a6d7a173 100644
--- a/crates/higgs/src/lib.rs
+++ b/crates/higgs/src/lib.rs
@@ -116,19 +116,21 @@ pub fn build_router(
 ///
 /// `None` (unset) sends no CORS headers; `["*"]` is fully permissive;
 /// anything else is an explicit origin allow-list.
-fn build_cors_layer(origins: Option<&[String]>) -> Option<CorsLayer> {
-    let origins = origins?;
+fn build_cors_layer(origins_opt: Option<&[String]>) -> Option<CorsLayer> {
+    let origins = origins_opt?;
     if origins.iter().any(|o| o == "*") {
         return Some(CorsLayer::permissive());
     }
     let parsed: Vec<HeaderValue> = origins
         .iter()
-        .filter_map(|origin| match origin.parse::<HeaderValue>() {
-            Ok(value) => Some(value),
-            Err(_) => {
-                tracing::warn!(origin = %origin, "ignoring invalid CORS origin");
-                None
-            }
+        .filter_map(|origin| {
+            origin.parse::<HeaderValue>().map_or_else(
+                |_| {
+                    tracing::warn!(origin = %origin, "ignoring invalid CORS origin");
+                    None
+                },
+                Some,
+            )
         })
         .collect();
     if parsed.is_empty() {
diff --git a/crates/higgs/src/proxy.rs b/crates/higgs/src/proxy.rs
index 406f2d0a..3cfef9cd 100644
--- a/crates/higgs/src/proxy.rs
+++ b/crates/higgs/src/proxy.rs
@@ -195,11 +195,11 @@ pub async fn send_to_provider(
 
     let status = upstream.status().as_u16();
     if status >= 400 {
-        let error_body = upstream
+        let error_body_raw = upstream
             .text()
             .await
             .unwrap_or_else(|_| String::from("(failed to read error body)"));
-        let error_body = truncate_error_body(&error_body, MAX_UPSTREAM_ERROR_BYTES);
+        let error_body = truncate_error_body(&error_body_raw, MAX_UPSTREAM_ERROR_BYTES);
         return Err(ServerError::ProxyError(format!(
             "upstream returned HTTP {status}: {error_body}"
         )));
diff --git a/crates/higgs/tests/integration/router.rs b/crates/higgs/tests/integration/router.rs
index 988eb014..fdf8f54b 100644
--- a/crates/higgs/tests/integration/router.rs
+++ b/crates/higgs/tests/integration/router.rs
@@ -181,7 +181,8 @@ fn build_router_is_public_and_callable() {
     // This test verifies that build_router was successfully extracted to lib.rs.
     // We cannot actually call it without a real AppState, but we can verify
     // the function signature exists and is accessible.
-    let _: fn(Arc<AppState>, f64, Option<String>, u32, usize) -> axum::Router = build_router;
+    let _: fn(Arc<AppState>, f64, Option<String>, u32, usize, Option<Vec<String>>) -> axum::Router =
+        build_router;
 }
 
 // ---------------------------------------------------------------------------

From c1953125d8399f899236d7917f26b94051fccb4b Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 11 Jun 2026 01:43:38 +0000
Subject: [PATCH 5/9] fix: split long first doc paragraphs for clippy

https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c
---
 crates/higgs-engine/src/simple.rs | 7 ++++---
 crates/higgs/src/config.rs        | 9 +++++----
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/crates/higgs-engine/src/simple.rs b/crates/higgs-engine/src/simple.rs
index c22b4c60..1d9a0077 100644
--- a/crates/higgs-engine/src/simple.rs
+++ b/crates/higgs-engine/src/simple.rs
@@ -2549,9 +2549,10 @@ impl IncrementalDetok {
     }
 }
 
-/// Find the earliest stop-sequence occurrence that could involve the newly
-/// appended text, scanning only the tail of `text` rather than the whole
-/// buffer. Returns the absolute byte position where the match starts.
+/// Find the earliest stop-sequence occurrence that could involve the newly appended text.
+///
+/// Scans only the tail of `text` rather than the whole buffer.
+/// Returns the absolute byte position where the match starts.
 pub(crate) fn find_stop_in_tail(
     text: &str,
     new_len: usize,
diff --git a/crates/higgs/src/config.rs b/crates/higgs/src/config.rs
index 7d1fc7f3..43373c47 100644
--- a/crates/higgs/src/config.rs
+++ b/crates/higgs/src/config.rs
@@ -961,10 +961,11 @@ pub fn default_config_path() -> PathBuf {
     config_dir().join("config.toml")
 }
 
-/// Write a file with owner-only permissions (0o600 on Unix). Used for config
-/// files (which may contain provider API keys) and other daemon-private
-/// files. The mode is applied at creation; existing files keep their
-/// permissions (doctor warns about loose ones).
+/// Write a file with owner-only permissions (0o600 on Unix).
+///
+/// Used for config files (which may contain provider API keys) and other
+/// daemon-private files. The mode is applied at creation; existing files
+/// keep their permissions (doctor warns about loose ones).
 pub fn write_private_file(path: &Path, contents: &str) -> std::io::Result<()> {
     use std::io::Write as _;
 

From d0845926a64c0179646855b628ef0da737e31b2f Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 11 Jun 2026 01:48:58 +0000
Subject: [PATCH 6/9] fix: type alias for build_router signature assertion
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Introduce `BuildRouterFn` type alias in the integration test to satisfy
`clippy::type_complexity` (fatal under `-D warnings` with nursery group).
No engine-crate changes were needed: audit of all new code added by this
PR in `crates/higgs-engine/src/simple.rs`, `batch_engine.rs`, and
`chat_template.rs` found no additional fatal lint violations — `use_self`,
`option_if_let_else`, `redundant_clone`, `uninlined_format_args`,
`items_after_statements`, `doc_markdown`, `missing_const_for_fn`, and
`trivially_copy_pass_by_ref` were each reviewed and ruled out.

https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c
---
 crates/higgs/tests/integration/router.rs | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/crates/higgs/tests/integration/router.rs b/crates/higgs/tests/integration/router.rs
index fdf8f54b..9e7b9e3e 100644
--- a/crates/higgs/tests/integration/router.rs
+++ b/crates/higgs/tests/integration/router.rs
@@ -176,13 +176,15 @@ async fn streaming_chat_returns_sse_content_type() {
 // build_router smoke test (verifies the function is callable and public)
 // ---------------------------------------------------------------------------
 
+type BuildRouterFn =
+    fn(Arc<AppState>, f64, Option<String>, u32, usize, Option<Vec<String>>) -> axum::Router;
+
 #[test]
 fn build_router_is_public_and_callable() {
     // This test verifies that build_router was successfully extracted to lib.rs.
     // We cannot actually call it without a real AppState, but we can verify
     // the function signature exists and is accessible.
-    let _: fn(Arc<AppState>, f64, Option<String>, u32, usize, Option<Vec<String>>) -> axum::Router =
-        build_router;
+    let _: BuildRouterFn = build_router;
 }
 
 // ---------------------------------------------------------------------------

From 722218df3cfb10c8671721cf45888ec45e31187c Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 11 Jun 2026 02:40:04 +0000
Subject: [PATCH 7/9] fix: route first streamed token through incremental
 detokenizer
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously the first generated token was decoded with a raw tokenizer
call, bypassing IncrementalDetok. This caused two bugs:
1. Partial UTF-8 sequences in the first token were never held back
   (they would appear as replacement characters instead of waiting
   for the next token to complete the codepoint).
2. When a stop sequence appeared mid-first-token, the prefix text
   before the stop was dropped entirely (batch_engine sent "" instead
   of the valid prefix).

Site A (simple.rs generate_streaming_inner): initialize detok with
empty state, run the first token through detok.append(), mirror the
main loop's find_stop_in_tail stop check, and flush on EOS/max finish.

Site B (batch_engine.rs prefill_request): same empty-init + append
approach; fix the dropped-prefix bug by extracting the pre-stop slice
from detok.text instead of sending ""; flush on non-stop finish.
check_stop_sequences_simple (and its tests) removed — now unused.

Also fix detach() PID file write in daemon.rs: replace fs::write
with config::write_private_file so the PID file gets mode 0600
(private), consistent with write_pid_file() higher in the same file.

New regression tests added to simple.rs:
- incremental_detok_first_token_partial_utf8_held_back
- incremental_detok_flush_emits_pending
- find_stop_in_tail_first_token_prefix_and_stop

https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c
---
 crates/higgs-engine/src/batch_engine.rs | 71 ++++++++-------------
 crates/higgs-engine/src/simple.rs       | 82 ++++++++++++++++++++++---
 crates/higgs/src/daemon.rs              |  2 +-
 3 files changed, 101 insertions(+), 54 deletions(-)

diff --git a/crates/higgs-engine/src/batch_engine.rs b/crates/higgs-engine/src/batch_engine.rs
index 0fb68024..cef09d41 100644
--- a/crates/higgs-engine/src/batch_engine.rs
+++ b/crates/higgs-engine/src/batch_engine.rs
@@ -723,20 +723,40 @@ fn prefill_request(
             .as_ref()
             .map(|lp| lp.materialize(first_token_id));
 
-        // Decode the first token for text diff tracking
-        let first_text = tokenizer
-            .decode(&[first_token_id], true)
+        // Decode the first token incrementally (routes through IncrementalDetok
+        // so partial UTF-8 is held back and prefix-before-stop is correctly emitted).
+        let mut detok = IncrementalDetok::new(String::new(), 0);
+        let first_chunk = detok
+            .append(tokenizer, &[first_token_id])
             .unwrap_or_default();
+        let emitted_before = detok.text.len() - first_chunk.len();
 
         // Check if we're done after the first token
         let is_eos = eos_token_ids.contains(&first_token_id);
-        let hit_stop = check_stop_sequences_simple(&first_text, &req.stop_sequences);
+        let hit_stop = !req.stop_sequences.is_empty()
+            && find_stop_in_tail(&detok.text, first_chunk.len(), &req.stop_sequences).is_some();
         let at_max = req.max_tokens <= 1;
 
         if is_eos || hit_stop || at_max {
             let finish_reason = if is_eos || hit_stop { "stop" } else { "length" };
+            let mut send_text = if hit_stop {
+                // Emit any prefix text before the stop sequence
+                find_stop_in_tail(&detok.text, first_chunk.len(), &req.stop_sequences)
+                    .and_then(|pos| detok.text.get(emitted_before..pos))
+                    .unwrap_or_default()
+                    .to_owned()
+            } else {
+                first_chunk
+            };
+            if !hit_stop {
+                send_text.push_str(
+                    &detok
+                        .flush(tokenizer, &[first_token_id])
+                        .unwrap_or_default(),
+                );
+            }
             let _ = req.response_tx.blocking_send(StreamingOutput {
-                new_text: if hit_stop { String::new() } else { first_text },
+                new_text: send_text,
                 finished: true,
                 finish_reason: Some(finish_reason.to_owned()),
                 prompt_tokens: prompt_len,
@@ -747,11 +767,10 @@ fn prefill_request(
         }
 
         // Send first token
-        let detok = IncrementalDetok::new(first_text.clone(), 1);
         if req
             .response_tx
             .blocking_send(StreamingOutput {
-                new_text: first_text,
+                new_text: first_chunk,
                 finished: false,
                 finish_reason: None,
                 prompt_tokens: prompt_len,
@@ -913,49 +932,11 @@ fn materialize_decode_step(
     finished || disconnected
 }
 
-/// Check if any stop sequence appears in the text.
-fn check_stop_sequences_simple(text: &str, stop_sequences: &[String]) -> bool {
-    stop_sequences.iter().any(|seq| text.contains(seq.as_str()))
-}
-
 #[cfg(test)]
 #[allow(clippy::panic, clippy::unwrap_used, clippy::indexing_slicing)]
 mod tests {
     use super::*;
 
-    // -----------------------------------------------------------------------
-    // check_stop_sequences_simple
-    // -----------------------------------------------------------------------
-
-    #[test]
-    fn stop_sequences_empty_never_matches() {
-        assert!(!check_stop_sequences_simple("hello world", &[]));
-    }
-
-    #[test]
-    fn stop_sequences_match_at_end() {
-        let stops = vec!["</s>".to_owned()];
-        assert!(check_stop_sequences_simple("some text</s>", &stops));
-    }
-
-    #[test]
-    fn stop_sequences_match_in_middle() {
-        let stops = vec!["STOP".to_owned()];
-        assert!(check_stop_sequences_simple("before STOP after", &stops));
-    }
-
-    #[test]
-    fn stop_sequences_no_match() {
-        let stops = vec!["</s>".to_owned(), "<|end|>".to_owned()];
-        assert!(!check_stop_sequences_simple("normal text", &stops));
-    }
-
-    #[test]
-    fn stop_sequences_multiple_one_matches() {
-        let stops = vec!["</s>".to_owned(), "\n\n".to_owned()];
-        assert!(check_stop_sequences_simple("text\n\nmore", &stops));
-    }
-
     // -----------------------------------------------------------------------
     // materialize_decode_step
     // -----------------------------------------------------------------------
diff --git a/crates/higgs-engine/src/simple.rs b/crates/higgs-engine/src/simple.rs
index 1d9a0077..3533790f 100644
--- a/crates/higgs-engine/src/simple.rs
+++ b/crates/higgs-engine/src/simple.rs
@@ -2214,20 +2214,32 @@ impl SimpleEngine {
         }
         all_tokens.push(first_token_id);
 
-        let first_decoded = self.decode_tokens(&all_tokens)?;
-        let (first_text, first_hit_stop) = if stop_sequences.is_empty() {
-            (first_decoded.clone(), false)
+        let mut detok = IncrementalDetok::new(String::new(), 0);
+        let new_text = detok.append(&self.tokenizer, &all_tokens)?;
+        let emitted_before = detok.text.len() - new_text.len();
+        let (mut first_text, first_hit_stop) = if stop_sequences.is_empty() {
+            (new_text, false)
         } else {
-            check_stop_sequences(&first_decoded, stop_sequences).map_or_else(
-                || (first_decoded.clone(), false),
-                |truncated| (truncated, true),
+            find_stop_in_tail(&detok.text, new_text.len(), stop_sequences).map_or(
+                (new_text, false),
+                |pos| {
+                    let emit = detok
+                        .text
+                        .get(emitted_before..pos)
+                        .unwrap_or_default()
+                        .to_owned();
+                    (emit, true)
+                },
             )
         };
-        let mut detok = IncrementalDetok::new(first_decoded, all_tokens.len());
 
         let first_is_eos = self.eos_token_ids.contains(&first_token_id);
         let finished = first_is_eos || first_hit_stop || 1 >= max_tokens;
 
+        if finished && !first_hit_stop {
+            first_text.push_str(&detok.flush(&self.tokenizer, &all_tokens)?);
+        }
+
         let first_logprob = first_logprob_data
             .as_ref()
             .map(|lp| lp.materialize(first_token_id));
@@ -3041,7 +3053,7 @@ mod tests {
             },
             "model": {
                 "type": "WordLevel",
-                "vocab": {"Hello": 0, "Ġworld": 1, "!": 2},
+                "vocab": {"Hello": 0, "Ġworld": 1, "!": 2, "ðŁĺ": 3, "Ģ": 4},
                 "unk_token": "Hello"
             }
         }"#;
@@ -3076,4 +3088,58 @@ mod tests {
         detok.append(&tokenizer, &tokens).unwrap();
         assert_eq!(detok.flush(&tokenizer, &tokens).unwrap(), "");
     }
+
+    /// Tokens 3 and 4 are the byte-level pieces of 😀 (U+1F600).
+    /// Appending only token 3 must hold back the incomplete UTF-8 sequence
+    /// (return ""), and appending both tokens must emit the full emoji.
+    #[test]
+    fn incremental_detok_first_token_partial_utf8_held_back() {
+        let tokenizer = word_tokenizer();
+        let mut detok = IncrementalDetok::new(String::new(), 0);
+
+        // First partial piece: held back, no replacement char emitted
+        let held = detok.append(&tokenizer, &[3]).unwrap();
+        assert_eq!(held, "", "partial UTF-8 token must be held back");
+
+        // Completing the sequence emits the full emoji
+        let emitted = detok.append(&tokenizer, &[3, 4]).unwrap();
+        assert_eq!(emitted, "😀", "completing UTF-8 should emit the emoji");
+
+        let full = tokenizer.decode(&[3u32, 4], true).unwrap();
+        assert_eq!(detok.text, full, "detok.text must equal full decode");
+    }
+
+    /// flush() must emit any bytes held back by append(), and a second flush
+    /// on the same (now-fully-drained) detok must return "".
+    #[test]
+    fn incremental_detok_flush_emits_pending() {
+        let tokenizer = word_tokenizer();
+        let mut detok = IncrementalDetok::new(String::new(), 0);
+
+        // Partial piece held back
+        let held = detok.append(&tokenizer, &[3]).unwrap();
+        assert_eq!(held, "", "partial UTF-8 must be held back before flush");
+
+        // flush must emit something (a replacement char is acceptable here)
+        let flushed = detok.flush(&tokenizer, &[3]).unwrap();
+        assert!(!flushed.is_empty(), "flush must emit held-back text");
+
+        // A second flush on an already-drained detok returns ""
+        let flushed2 = detok.flush(&tokenizer, &[3]).unwrap();
+        assert_eq!(flushed2, "", "second flush must return empty string");
+    }
+
+    /// find_stop_in_tail must identify a stop whose prefix was already emitted,
+    /// returning the byte position that lets the caller emit the prefix ("hi")
+    /// before the stop ("STOP") in the same new-text window.
+    #[test]
+    fn find_stop_in_tail_first_token_prefix_and_stop() {
+        let stops = vec!["STOP".to_owned()];
+        // "hiSTOP": all 6 bytes are new, stop starts at byte 2
+        assert_eq!(
+            find_stop_in_tail("hiSTOP", 6, &stops),
+            Some(2),
+            "stop at pos 2 so 'hi' can be emitted as prefix"
+        );
+    }
 }
diff --git a/crates/higgs/src/daemon.rs b/crates/higgs/src/daemon.rs
index f980c50e..3322abeb 100644
--- a/crates/higgs/src/daemon.rs
+++ b/crates/higgs/src/daemon.rs
@@ -541,7 +541,7 @@ pub fn detach(config_path: &Path, verbose: bool, profile: Option<&str>) {
         let _ = child.wait();
     });
 
-    if let Err(e) = fs::write(pid_path(profile), child_pid.to_string()) {
+    if let Err(e) = config::write_private_file(&pid_path(profile), &child_pid.to_string()) {
         eprintln!("failed to write pid file: {e}");
         std::process::exit(1);
     }

From deefbe39345669e101b9f1fc5caa90de0aaaa973 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 11 Jun 2026 02:42:51 +0000
Subject: [PATCH 8/9] fix: doc_markdown lints in detok test comments

https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c
---
 crates/higgs-engine/src/simple.rs | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/crates/higgs-engine/src/simple.rs b/crates/higgs-engine/src/simple.rs
index 3533790f..cd1b8025 100644
--- a/crates/higgs-engine/src/simple.rs
+++ b/crates/higgs-engine/src/simple.rs
@@ -3089,9 +3089,9 @@ mod tests {
         assert_eq!(detok.flush(&tokenizer, &tokens).unwrap(), "");
     }
 
-    /// Tokens 3 and 4 are the byte-level pieces of 😀 (U+1F600).
-    /// Appending only token 3 must hold back the incomplete UTF-8 sequence
-    /// (return ""), and appending both tokens must emit the full emoji.
+    // Tokens 3 and 4 are the byte-level pieces of 😀 (U+1F600).
+    // Appending only token 3 must hold back the incomplete UTF-8 sequence
+    // (return ""), and appending both tokens must emit the full emoji.
     #[test]
     fn incremental_detok_first_token_partial_utf8_held_back() {
         let tokenizer = word_tokenizer();
@@ -3109,8 +3109,8 @@ mod tests {
         assert_eq!(detok.text, full, "detok.text must equal full decode");
     }
 
-    /// flush() must emit any bytes held back by append(), and a second flush
-    /// on the same (now-fully-drained) detok must return "".
+    // flush() must emit any bytes held back by append(), and a second flush
+    // on the same (now-fully-drained) detok must return "".
     #[test]
     fn incremental_detok_flush_emits_pending() {
         let tokenizer = word_tokenizer();
@@ -3129,9 +3129,9 @@ mod tests {
         assert_eq!(flushed2, "", "second flush must return empty string");
     }
 
-    /// find_stop_in_tail must identify a stop whose prefix was already emitted,
-    /// returning the byte position that lets the caller emit the prefix ("hi")
-    /// before the stop ("STOP") in the same new-text window.
+    // find_stop_in_tail must identify a stop whose prefix was already emitted,
+    // returning the byte position that lets the caller emit the prefix ("hi")
+    // before the stop ("STOP") in the same new-text window.
     #[test]
     fn find_stop_in_tail_first_token_prefix_and_stop() {
         let stops = vec!["STOP".to_owned()];

From 8d16696c2d59750d3d742f203ab7316a4785b2be Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Thu, 11 Jun 2026 02:45:49 +0000
Subject: [PATCH 9/9] fix: rename first-token detok bindings to avoid shadowing
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

In `generate_streaming_inner` (simple.rs), the first-token block
used `new_text` and `emitted_before` as binding names, which were
then reused in the main decode loop — triggering `-D clippy::shadow_unrelated`.

Rename the first-token block's bindings (lines ~2218-2234) to
`first_new_text` and `first_emitted_before` throughout that section
(`detok.append` result, `detok.text.len()` subtraction, both arms of
the `find_stop_in_tail` map_or, and the range in `.get()`).
The main loop's bindings at lines 2389-2390 are unchanged.

No logic change; no other shadowing candidates found in commits
722218d or deefbe3 (batch_engine.rs `prefill_request` binds
`first_chunk`/`emitted_before` once each; `materialize_decode_step`
uses `new_text`/`emitted_before` in a separate function scope).

https://claude.ai/code/session_01Rko86UGb3GKp84joEs5E5c
---
 crates/higgs-engine/src/simple.rs | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/crates/higgs-engine/src/simple.rs b/crates/higgs-engine/src/simple.rs
index cd1b8025..ac011529 100644
--- a/crates/higgs-engine/src/simple.rs
+++ b/crates/higgs-engine/src/simple.rs
@@ -2215,17 +2215,17 @@ impl SimpleEngine {
         all_tokens.push(first_token_id);
 
         let mut detok = IncrementalDetok::new(String::new(), 0);
-        let new_text = detok.append(&self.tokenizer, &all_tokens)?;
-        let emitted_before = detok.text.len() - new_text.len();
+        let first_new_text = detok.append(&self.tokenizer, &all_tokens)?;
+        let first_emitted_before = detok.text.len() - first_new_text.len();
         let (mut first_text, first_hit_stop) = if stop_sequences.is_empty() {
-            (new_text, false)
+            (first_new_text, false)
         } else {
-            find_stop_in_tail(&detok.text, new_text.len(), stop_sequences).map_or(
-                (new_text, false),
+            find_stop_in_tail(&detok.text, first_new_text.len(), stop_sequences).map_or(
+                (first_new_text, false),
                 |pos| {
                     let emit = detok
                         .text
-                        .get(emitted_before..pos)
+                        .get(first_emitted_before..pos)
                         .unwrap_or_default()
                         .to_owned();
                     (emit, true)