-
Notifications
You must be signed in to change notification settings - Fork 0
Inference Strategy
Last Mile 360 does not run any LLM inference infrastructure. No GPUs, no CUDA, no model weight downloads, no VRAM management. This is a deliberate decision, not a limitation.
The reasoning:
-
Security products should minimize attack surface. A GPU server running inference is a server that needs patching, monitoring, and hardening. It's an origin server — the exact thing our architecture eliminates.
-
Operational burden is not our value proposition. Users want security findings, not a side quest managing OOM kills on a 4090.
-
API-based inference is better for our use case. Code analysis prompts are well-suited to API calls: structured input, structured output, no streaming requirements, cacheable.
-
Cost is predictable. API pricing is per-token. Self-hosted inference has fixed costs whether or not anyone is scanning.
-
The Claw family was evaluated and rejected entirely. See Source Repo Analysis for the full evaluation of OpenClaw, NanoClaw, PicoClaw, ZeroClaw, and MimiClaw.
Role: Complex analysis requiring deep reasoning
Used for:
- Contextual vulnerability assessment (is this
innerHTMLactually exploitable?) - Cross-file dependency analysis
- Fix suggestion generation with framework awareness
- Severity calibration (is this a real risk or a false positive?)
Why Claude: Best-in-class for code understanding, instruction following, and structured output. Low hallucination rate on security-relevant judgments.
Model: claude-sonnet-4-20250514 (balances quality and cost)
Role: Edge inference for private/fast analysis
Used for:
- Pattern classification (is this string a secret or a false positive?)
- Quick severity triage before escalating to Tier 1
- Embedding generation for Vectorize (semantic code search)
- Summary generation for reports
Why Workers AI: Runs in the same Cloudflare network as the rest of the stack. No data leaves Cloudflare's infrastructure — important for customers with data residency requirements.
Models: @cf/meta/llama-3.1-8b-instruct (general), @cf/baai/bge-base-en-v1.5 (embeddings)
Role: Redundancy and specialized capabilities
Used for:
- Fallback when Claude API is unavailable
- Specific tasks where GPT-4 or Gemini has an edge (rare)
- A/B testing model quality for specific rule types
Why fallback matters: A security scanner that goes down when an API provider has an outage is not a security scanner. The fallback chain ensures findings are always produced.
All LLM traffic flows through Cloudflare AI Gateway, which provides:
- Every prompt and response is logged (redacted for customer data)
- Token usage tracked per scan, per agent, per model
- Latency metrics for model performance monitoring
- Customer source code is never included in prompts verbatim for Tier 3 providers
- Code snippets are abstracted to patterns before LLM analysis
- PII detection on outbound prompts
- Workers AI (Tier 2) is exempt — data stays within Cloudflare
- Per-customer rate limits prevent abuse
- Per-model rate limits respect provider quotas
- Burst allowance for large scans with graceful degradation
- Per-scan cost attribution
- Per-agent cost breakdown
- Monthly budget alerts
- Automatic model downgrade if budget exceeded (Claude → Workers AI)
Claude API (Tier 1)
│ fails/timeout
▼
Workers AI (Tier 2)
│ fails/timeout
▼
OpenAI API (Tier 3a)
│ fails/timeout
▼
Gemini API (Tier 3b)
│ fails/timeout
▼
Rule-only mode (no LLM, SAST rules still run)
The final fallback is critical: even if every LLM provider is down simultaneously, the 14 SAST rules and dependency checks still run. The scanner degrades gracefully — the LLM layer adds context, not core functionality.
| Factor | Self-Hosted (Claw) | API-Based (Last Mile) |
|---|---|---|
| Infrastructure | GPU servers, CUDA, drivers | Zero servers |
| Cold start | Minutes (model loading) | Milliseconds (API call) |
| Scaling | Manual capacity planning | Automatic |
| Cost model | Fixed (idle servers cost money) | Per-token (idle = $0) |
| Security | Server to patch and harden | No attack surface |
| Redundancy | Single point of failure | 4-tier fallback chain |
| Updates | Manual model weight downloads | Provider handles updates |
| Compliance | Data on your GPUs | Data on Cloudflare/provider infrastructure |
The conclusion: for a security product that scans code, self-hosted inference is a liability, not an asset.
Last Mile 360
Agents
Usage
Technical
Project