Skip to content

Inference Strategy

John Williams edited this page Mar 16, 2026 · 1 revision

Inference Strategy

Why No Self-Hosted Models

Last Mile 360 does not run any LLM inference infrastructure. No GPUs, no CUDA, no model weight downloads, no VRAM management. This is a deliberate decision, not a limitation.

The reasoning:

  1. Security products should minimize attack surface. A GPU server running inference is a server that needs patching, monitoring, and hardening. It's an origin server — the exact thing our architecture eliminates.

  2. Operational burden is not our value proposition. Users want security findings, not a side quest managing OOM kills on a 4090.

  3. API-based inference is better for our use case. Code analysis prompts are well-suited to API calls: structured input, structured output, no streaming requirements, cacheable.

  4. Cost is predictable. API pricing is per-token. Self-hosted inference has fixed costs whether or not anyone is scanning.

  5. The Claw family was evaluated and rejected entirely. See Source Repo Analysis for the full evaluation of OpenClaw, NanoClaw, PicoClaw, ZeroClaw, and MimiClaw.


Three-Tier Model Strategy

Tier 1: Claude API (Anthropic)

Role: Complex analysis requiring deep reasoning

Used for:

  • Contextual vulnerability assessment (is this innerHTML actually exploitable?)
  • Cross-file dependency analysis
  • Fix suggestion generation with framework awareness
  • Severity calibration (is this a real risk or a false positive?)

Why Claude: Best-in-class for code understanding, instruction following, and structured output. Low hallucination rate on security-relevant judgments.

Model: claude-sonnet-4-20250514 (balances quality and cost)

Tier 2: Cloudflare Workers AI

Role: Edge inference for private/fast analysis

Used for:

  • Pattern classification (is this string a secret or a false positive?)
  • Quick severity triage before escalating to Tier 1
  • Embedding generation for Vectorize (semantic code search)
  • Summary generation for reports

Why Workers AI: Runs in the same Cloudflare network as the rest of the stack. No data leaves Cloudflare's infrastructure — important for customers with data residency requirements.

Models: @cf/meta/llama-3.1-8b-instruct (general), @cf/baai/bge-base-en-v1.5 (embeddings)

Tier 3: OpenAI / Gemini (Fallback)

Role: Redundancy and specialized capabilities

Used for:

  • Fallback when Claude API is unavailable
  • Specific tasks where GPT-4 or Gemini has an edge (rare)
  • A/B testing model quality for specific rule types

Why fallback matters: A security scanner that goes down when an API provider has an outage is not a security scanner. The fallback chain ensures findings are always produced.


AI Gateway

All LLM traffic flows through Cloudflare AI Gateway, which provides:

Logging

  • Every prompt and response is logged (redacted for customer data)
  • Token usage tracked per scan, per agent, per model
  • Latency metrics for model performance monitoring

DLP (Data Loss Prevention)

  • Customer source code is never included in prompts verbatim for Tier 3 providers
  • Code snippets are abstracted to patterns before LLM analysis
  • PII detection on outbound prompts
  • Workers AI (Tier 2) is exempt — data stays within Cloudflare

Rate Limiting

  • Per-customer rate limits prevent abuse
  • Per-model rate limits respect provider quotas
  • Burst allowance for large scans with graceful degradation

Cost Tracking

  • Per-scan cost attribution
  • Per-agent cost breakdown
  • Monthly budget alerts
  • Automatic model downgrade if budget exceeded (Claude → Workers AI)

Fallback Chain

Claude API (Tier 1)
    │ fails/timeout
    ▼
Workers AI (Tier 2)
    │ fails/timeout
    ▼
OpenAI API (Tier 3a)
    │ fails/timeout
    ▼
Gemini API (Tier 3b)
    │ fails/timeout
    ▼
Rule-only mode (no LLM, SAST rules still run)

The final fallback is critical: even if every LLM provider is down simultaneously, the 14 SAST rules and dependency checks still run. The scanner degrades gracefully — the LLM layer adds context, not core functionality.


Why the Claw Family Was Rejected Entirely

Factor Self-Hosted (Claw) API-Based (Last Mile)
Infrastructure GPU servers, CUDA, drivers Zero servers
Cold start Minutes (model loading) Milliseconds (API call)
Scaling Manual capacity planning Automatic
Cost model Fixed (idle servers cost money) Per-token (idle = $0)
Security Server to patch and harden No attack surface
Redundancy Single point of failure 4-tier fallback chain
Updates Manual model weight downloads Provider handles updates
Compliance Data on your GPUs Data on Cloudflare/provider infrastructure

The conclusion: for a security product that scans code, self-hosted inference is a liability, not an asset.

Clone this wiki locally