Inference Strategy

Why No Self-Hosted Models

Last Mile 360 does not run any LLM inference infrastructure. No GPUs, no CUDA, no model weight downloads, no VRAM management. This is a deliberate decision, not a limitation.

The reasoning:

Security products should minimize attack surface. A GPU server running inference is a server that needs patching, monitoring, and hardening. It's an origin server — the exact thing our architecture eliminates.
Operational burden is not our value proposition. Users want security findings, not a side quest managing OOM kills on a 4090.
API-based inference is better for our use case. Code analysis prompts are well-suited to API calls: structured input, structured output, no streaming requirements, cacheable.
Cost is predictable. API pricing is per-token. Self-hosted inference has fixed costs whether or not anyone is scanning.
The Claw family was evaluated and rejected entirely. See Source Repo Analysis for the full evaluation of OpenClaw, NanoClaw, PicoClaw, ZeroClaw, and MimiClaw.

Three-Tier Model Strategy

Tier 1: Claude API (Anthropic)

Role: Complex analysis requiring deep reasoning

Used for:

Contextual vulnerability assessment (is this innerHTML actually exploitable?)
Cross-file dependency analysis
Fix suggestion generation with framework awareness
Severity calibration (is this a real risk or a false positive?)

Why Claude: Best-in-class for code understanding, instruction following, and structured output. Low hallucination rate on security-relevant judgments.

Model: claude-sonnet-4-20250514 (balances quality and cost)

Tier 2: Cloudflare Workers AI

Role: Edge inference for private/fast analysis

Used for:

Pattern classification (is this string a secret or a false positive?)
Quick severity triage before escalating to Tier 1
Embedding generation for Vectorize (semantic code search)
Summary generation for reports

Why Workers AI: Runs in the same Cloudflare network as the rest of the stack. No data leaves Cloudflare's infrastructure — important for customers with data residency requirements.

Models: @cf/meta/llama-3.1-8b-instruct (general), @cf/baai/bge-base-en-v1.5 (embeddings)

Tier 3: OpenAI / Gemini (Fallback)

Role: Redundancy and specialized capabilities

Used for:

Fallback when Claude API is unavailable
Specific tasks where GPT-4 or Gemini has an edge (rare)
A/B testing model quality for specific rule types

Why fallback matters: A security scanner that goes down when an API provider has an outage is not a security scanner. The fallback chain ensures findings are always produced.

AI Gateway

All LLM traffic flows through Cloudflare AI Gateway, which provides:

Logging

Every prompt and response is logged (redacted for customer data)
Token usage tracked per scan, per agent, per model
Latency metrics for model performance monitoring

DLP (Data Loss Prevention)

Customer source code is never included in prompts verbatim for Tier 3 providers
Code snippets are abstracted to patterns before LLM analysis
PII detection on outbound prompts
Workers AI (Tier 2) is exempt — data stays within Cloudflare

Rate Limiting

Per-customer rate limits prevent abuse
Per-model rate limits respect provider quotas
Burst allowance for large scans with graceful degradation

Cost Tracking

Per-scan cost attribution
Per-agent cost breakdown
Monthly budget alerts
Automatic model downgrade if budget exceeded (Claude → Workers AI)

Fallback Chain

Claude API (Tier 1)
    │ fails/timeout
    ▼
Workers AI (Tier 2)
    │ fails/timeout
    ▼
OpenAI API (Tier 3a)
    │ fails/timeout
    ▼
Gemini API (Tier 3b)
    │ fails/timeout
    ▼
Rule-only mode (no LLM, SAST rules still run)

The final fallback is critical: even if every LLM provider is down simultaneously, the 14 SAST rules and dependency checks still run. The scanner degrades gracefully — the LLM layer adds context, not core functionality.

Why the Claw Family Was Rejected Entirely

Factor	Self-Hosted (Claw)	API-Based (Last Mile)
Infrastructure	GPU servers, CUDA, drivers	Zero servers
Cold start	Minutes (model loading)	Milliseconds (API call)
Scaling	Manual capacity planning	Automatic
Cost model	Fixed (idle servers cost money)	Per-token (idle = $0)
Security	Server to patch and harden	No attack surface
Redundancy	Single point of failure	4-tier fallback chain
Updates	Manual model weight downloads	Provider handles updates
Compliance	Data on your GPUs	Data on Cloudflare/provider infrastructure

The conclusion: for a security product that scans code, self-hosted inference is a liability, not an asset.

Last Mile 360

Agents

Usage

Technical

Project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference Strategy

Inference Strategy

Why No Self-Hosted Models

Three-Tier Model Strategy

Tier 1: Claude API (Anthropic)

Tier 2: Cloudflare Workers AI

Tier 3: OpenAI / Gemini (Fallback)

AI Gateway

Logging

DLP (Data Loss Prevention)

Rate Limiting

Cost Tracking

Fallback Chain

Why the Claw Family Was Rejected Entirely

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally