Skip to content

hkevin01/secure-llm-assistant

Repository files navigation

🔒 Secure LLM Assistant

Air-gapped, on-prem LLM platform for translating legacy Java codebases and requirements into idiomatic Python — with zero internet egress and a full audit trail.

License Stars Forks Last Commit Repo Size Issues

Python FastAPI Tests OWASP Kubernetes Air-Gapped


🚀 What Is This & Why Should You Care?

If your team is staring down a mountain of legacy Java code that needs to become Python — and you work in a classified, air-gapped, or otherwise locked-down environment where sending code to OpenAI or any cloud API is simply not an option — this is the platform built for that exact problem.

Secure LLM Assistant runs a full AI-powered code modernisation stack entirely inside your network. No packets leave. No model calls an external server. No code ever touches a vendor's API. The LLM runs on your GPU node. The audit log stays on your hardware.


🎯 The Core Job: Java → Python, Done Right

The platform's primary mission is translating legacy Java codebases into production-quality Python 3.12+. This isn't a simple find-and-replace — it's a structured, context-aware conversion pipeline that understands Java semantics:

  1. Parses your Java first using a pure-Python Java parser (javalang) — no JVM required. It extracts class names, inheritance chains, method signatures, field types, and import graphs before a single token goes to the LLM.

  2. Enriches the prompt with that metadata so the LLM understands the class structure, not just the raw text. A HashMap<String, List<Order>> becomes dict[str, list[Order]] — not dict.

  3. Applies a translation style you choose"idiomatic" for maximum Pythonic output (dataclasses, comprehensions, context managers), "typed" for strict mypy/pyright-compatible annotations, or "literal" for a line-by-line reference mapping that makes audits easy.

  4. Handles whole projects, not just single files. Submit a {filename: source} dict of your entire Java package. The platform builds a dependency graph, runs a topological sort (Kahn's algorithm), and translates base classes before subclasses — so OrderService already sees a translated Order class when it's generated. No broken imports. No undefined references.

  5. Remembers your conversation. Each translation returns a session_id. Pass it back and say "now make all methods async" or "add type annotations throughout" — the LLM sees the full prior context and refines the output iteratively, exactly the way a human pair-programmer would.

Full Java → Python type mapping reference — every primitive, collection, and pattern covered — is in the Java → Python Translation section.


🛡️ Security Is Not An Afterthought — It's The Architecture

Every request passes through five independent defence layers regardless of which endpoint is called:

Layer What It Does Why It Matters
① Input Guardrail Blocks 8 families of prompt injection patterns, detects credentials/secrets in the input, enforces size limits An attacker cannot use the translate endpoint as an LLM jailbreak vector
② JWT + RBAC RS256 asymmetric token validation against your internal IdP; role → permission mapping enforced per endpoint The orchestrator holds only the public key — it can verify tokens but cannot mint them, so a compromised orchestrator cannot impersonate users
③ Hardened System Prompt Anti-override and anti-exfiltration directives that the LLM cannot ignore via user input Prevents "ignore all previous instructions" style attacks from succeeding even if the input guardrail is somehow bypassed
④ Output Guardrail Scrubs credentials, secrets, and PII patterns from every LLM response before it reaches the caller Prevents an LLM that hallucinates a credential string from leaking it to an unprivileged caller
⑤ Immutable Audit Log Append-only JSONL log of every action — metadata only, no raw code or prompts recorded Provides a tamper-evident record for STIG compliance and incident response

Provider Lock (PROVIDER_LOCK=true) is the IT-administered configuration freeze: once set in the Kubernetes Secret, no developer, env var injection, or misconfiguration can redirect LLM traffic to a public internet endpoint. The assert_egress_url_safe() function blocks public cloud domains at the application layer; the Kubernetes NetworkPolicy drops those packets at the CNI layer. Two independent layers that must both be defeated to exfiltrate data.

PRC-origin models are permanently blocked. core/provider_lock.py rejects Qwen (Alibaba), DeepSeek, Baichuan, InternLM, ChatGLM/GLM, MiniMax, and Moonshot/Kimi at startup and on every call. No configuration option re-enables them. This is a hard-coded supply-chain control, not a policy setting.

Built-in static analysis catches dangerous Python patterns in translated output before it reaches the caller: eval/exec calls, pickle deserialisation, subprocess injection, and hardcoded credentials. The Java analyser runs the same checks on Java input to enrich translation prompts with security context.


🔧 Why These Technologies?

Every component in the stack was chosen for a specific reason — not defaults, not hype:

Technology Why It Was Chosen (Not Just What It Is)
Python 3.12 + FastAPI FastAPI's Depends() injection chain is ideal for stacking auth → guardrail → session → LLM in a single readable pipeline. Pydantic v2 validates every request at the boundary — malformed input raises before any business logic runs. The ast stdlib module is used for Python static analysis with no external dependencies.
RS256 JWT (PyJWT) Asymmetric signing means the orchestrator only needs the public key. A compromised service pod cannot mint tokens. HS256 would require distributing a shared secret to every service — a supply-chain risk in a classified environment.
pydantic-settings Fail-fast config validation at import time. If LLM_ENDPOINT is not set, the service refuses to start with a clear error instead of silently failing on the first LLM call. Catches misconfiguration before a deployment goes live.
javalang (pure Python Java parser) No JVM needed on the inference node. Parses Java source into an AST with class structure, method signatures, and import graphs — without executing any Java. Running a JVM in an air-gapped environment creates an additional attack surface and heavyweight dependency.
httpx (async HTTP) All LLM calls use httpx.AsyncClient with a per-request 120-second timeout. No global session means no shared state between requests. The assert_egress_url_safe() check runs on every constructed URL before the connection opens — not at config time.
PostgreSQL 16 + pgvector RAG context enrichment without standing up a separate vector database service. pgvector's ivfflat approximate nearest-neighbour search is fast enough for code retrieval workloads. Reuses existing Postgres infrastructure the ops team already knows, monitors, and backs up.
nomic-embed-text 8192-token context window covers entire Java class files in a single embedding — no chunking needed for most real-world classes. Apache 2.0 licence. Runs entirely self-hosted via Ollama. No calls to an external embedding API.
Ollama (dev) + vLLM (prod) Both expose an OpenAI-compatible /v1/chat/completions API, so the same llm_client.py works for both with a one-env-var switch. Ollama is one-command for a developer laptop. vLLM's PagedAttention and tensor parallelism handle the 70B+ model batching required for production throughput on a GPU cluster.
boto3 + AWS Bedrock GovCloud For teams that need FedRAMP High / IL4/IL5 authorisation without managing GPU hardware, Bedrock GovCloud provides Llama 3 and Claude 3.5 Sonnet with DISA authorization. boto3's Signature V4 is handled automatically; the BedrockGovProvider enforces us-gov-west-1 / us-gov-east-1 regions at startup and blocks all PRC-origin model IDs.
redis>=5.0 (session backend) The default in-process session store is single-replica only. Set SESSION_BACKEND=redis to switch to the Redis/Valkey-backed store for horizontal scaling across multiple orchestrator pods — no code changes, no API differences, just one env var.
Kubernetes + NetworkPolicy NetworkPolicy operates at the CNI layer — it cannot be bypassed by application code. Default-deny with an explicit allowlist means new egress paths cannot appear accidentally. Rolling deploys ensure zero-downtime updates. Namespace isolation limits blast radius if a pod is compromised.
difflib (stdlib diff translation) The /translate-diff endpoint computes unified diffs between before/after Java versions and translates only the changed hunks — not the whole file. This means a 2-line method change in a 2,000-line class sends ~40 lines to the LLM, not 2,000. Uses only Python stdlib; no additional dependency.

📋 Quick Capability Summary

What You Can Do Endpoint Notes
Translate a Java class to Python POST /api/v1/translate Choose idiomatic, typed, or literal style
Translate a whole Java project POST /api/v1/translate-project Dependency-ordered; handles inheritance graphs
Convert requirements docs to Python stubs POST /api/v1/translate-requirements One typed stub + one pytest stub per requirement
Translate only the changed lines (diff) POST /api/v1/translate-diff Submit before/after Java; only changed hunks are translated
Ask follow-up questions about translated code POST /api/v1/chat Full multi-turn session memory
Get an OWASP code review POST /api/v1/review Python and Java supported
Generate a pytest or JUnit 5 test suite POST /api/v1/generate-tests Engineers: only; contractors: blocked by RBAC
Check if a model has a US government ATO POST /api/v1/evaluate-model-ato Scores 6 supply-chain criteria; returns tier + report
Analyse algorithm complexity (Big-O) POST /api/v1/analyze-algorithm With session memory for iterative refinement

Table of Contents


🔍 Overview

The Secure LLM Assistant is an internal, air-gapped LLM orchestration platform purpose-built for engineering teams modernising legacy Java systems. Its primary mission is to translate Java source code into production-quality Python 3.12+ and convert legacy requirements documents into typed, runnable Python scaffolds — all without a single packet leaving the classified network.

Important

This system makes zero external network calls. Every LLM request, embedding operation, and authentication check routes exclusively to internal services. The K8s NetworkPolicy enforces this at the infrastructure layer — there is no outbound internet rule and none will be added.

Who is it for? Software engineering teams on classified or high-security networks who need to modernise a Java codebase to Python but cannot use cloud LLM APIs. Engineers submit code through the VS Code extension or web app; the platform returns idiomatic, type-annotated Python with an immutable, metadata-only audit record.

What problem does it solve? Large Java codebases (100k–2M LOC) take years to rewrite manually. This platform accelerates that work with an LLM that runs on your GPU node, understands Java class structure via a pure-Python parser, and enforces consistent idiomatic output through explicit, verifiable translation rules — with no IP leaving the building.

(back to top ↑)


✨ Key Features

Icon Feature Description Impact Status
🔄 Java → Python Translation Converts legacy Java to idiomatic Python 3.12+ with type hints, dataclasses, and Pythonic patterns Primary ✅ Stable
�️ Multi-file Project Translation Dependency graph + topological sort translates whole Java projects in base-class-first order Primary ✅ Stable
📋 Requirements → Python Scaffold Translates legacy requirements docs into typed function stubs + pytest stubs Primary ✅ Stable
💬 Multi-turn Session Memory Per-user conversation history with TTL, sliding-window budget, and auto session IDs High ✅ Stable
🧠 RAG Context Enrichment pgvector semantic search injects internal code context into every translation prompt High ✅ Stable
🛡️ Prompt Injection Defence Regex guardrail blocks 8 injection pattern families before any LLM call Critical ✅ Stable
🔑 JWT/OIDC + RBAC RS256 token validation against internal IdP; role→permission mapping Critical ✅ Stable
🔍 Python AST Security Scan Detects eval/exec/pickle/subprocess injection and hardcoded credentials High ✅ Stable
Java Structural Analysis Pure-Python Java parser (no JVM); enriches translation prompts with class metadata High ✅ Stable
📝 Immutable Audit Trail Append-only JSONL log; metadata only — no raw code or prompts ever recorded Critical ✅ Stable
🔒 Output Redaction Strips secrets/credentials from every LLM response before reaching the caller Critical ✅ Stable
🔌 Pluggable LLM Backend + Provider Lock Switch between on-prem (Ollama/vLLM) and Azure OpenAI Government via env vars; PROVIDER_LOCK=true freezes the config and blocks all PRC-origin models + public egress at startup Critical ✅ Stable
�🚀 One-command Deploy Docker Compose for dev; K8s manifests for production High ✅ Stable

Three translation styles are available on every /translate call:

  • "idiomatic" — maximum Python idioms: dataclasses, properties, list comprehensions, context managers
  • "typed" — strict type annotations: TypeVar, Protocol, Generic[T], fully annotated variables
  • "literal" — structure-preserving reference mapping for engineers who need a line-by-line audit

(back to top ↑)


🏗️ Architecture

flowchart TD
    subgraph CLIENTS["Client Layer"]
        VSC[VS Code Extension]
        WEB[Web App]
        IDE[JetBrains Plugin]
    end

    subgraph GATEWAY["API Gateway (internal)"]
        GW[OIDC Auth · RBAC · Rate Limit]
    end

    subgraph ORCHESTRATOR["Orchestrator — FastAPI"]
        IG[Input Guardrail\ninjection · secrets · length]
        SS[Session Store\nper-user TTL history]
        TRANS[Translation Tools\nJava→Python · Multi-file · Reqs→Python]
        ANAL[Static Analysis\nAST · Java Parser]
        LLM_C[LLM Client\ntemp=0 · system prompt · history injection]
        OG[Output Guardrail\ncredential redaction]
        AUDIT[Audit Logger\nappend-only JSONL]
    end

    subgraph BACKEND["Backend Services (internal only)"]
        LLM[LLM Inference\nvLLM / Ollama]
        VDB[(pgvector\nRAG Index)]
        IDP[Internal IdP\nOIDC / LDAP]
    end

    CLIENTS --> GW --> IG
    IG -->|clean| TRANS
    IG -->|clean| ANAL
    SS -->|history| LLM_C
    TRANS --> LLM_C
    ANAL --> LLM_C
    LLM_C --> LLM
    LLM --> OG --> AUDIT
    LLM_C -.->|RAG retrieval| VDB
    GW -.->|JWKS| IDP

    style CLIENTS fill:#1e3a5f,color:#fff
    style ORCHESTRATOR fill:#0d3b2e,color:#fff
    style BACKEND fill:#3b1e1e,color:#fff
Loading

Request Data Flow

sequenceDiagram
    participant Dev as Developer
    participant GW as API Gateway
    participant IG as Input Guard
    participant SS as Session Store
    participant JA as Java Analyzer
    participant RAG as pgvector RAG
    participant LLM as LLM Inference
    participant OG as Output Guard
    participant AL as Audit Log

    Dev->>GW: POST /api/v1/translate {code, style, session_id}
    GW->>GW: Verify RS256 JWT · Check translate permission
    GW->>IG: sanitize(code)
    IG-->>GW: InputGuardError → HTTP 400 (if blocked)
    IG->>JA: parse_java_class(code)
    JA-->>IG: JavaClassInfo {name, methods, fields, complexity}
    IG->>RAG: retrieve(rag_query) via asyncpg → pgvector
    RAG-->>IG: top-k relevant internal code snippets
    IG->>SS: get_history(user_sub, session_id)
    SS-->>IG: trimmed prior turns (≤3000 chars)
    IG->>LLM: POST /chat/completions {system + history + rag_context + code}
    LLM-->>OG: raw Python translation
    OG->>OG: redact credential patterns
    OG->>SS: append_exchange(user_sub, session_id, prompt, response)
    OG->>AL: audit.log(user_id, "translate", tool, blocked=false)
    OG-->>Dev: {"python": "...", "java_metadata": {...}, "session_id": "..."}
Loading

Component Responsibilities

Component File Responsibility
API Router src/api/routes.py 9 endpoints; guardrail + session + audit enforcement on all
Auth src/core/auth.py RS256 JWT validation; RBAC permission check
Session Store src/core/session_store.py In-process per-user conversation history; TTL=3600s; sliding-window budget
LLM Client src/core/llm_client.py Internal-only httpx; temp=0; hardened system prompt; history injection
Input Guard src/guardrails/input_guard.py Injection · credential · length defence
Output Guard src/guardrails/output_guard.py Credential redaction from LLM responses
Translation Tools src/tools/translation_tools.py Java→Python · Requirements→Python prompt builders; 3 style directives
Project Translator src/tools/project_translator.py Dependency graph + Kahn topological sort for multi-file projects
Java Analyzer src/tools/java_analyzer.py Pure-Python Java parser; no JVM required
Python Analyzer src/tools/python_analyzer.py AST-based SAST; complexity and security scanning
RAG Retriever src/rag/retriever.py Embed query → asyncpg → pgvector top-k → formatted context string
RAG Indexer services/rag-indexer-python/src/indexer.py pgvector chunked embedding indexer

(back to top ↑)


🧩 Object Model

classDiagram
    class TranslationRequest {
        +str code
        +str text
        +str style
        +str prompt
        +str session_id
    }
    class ProjectTranslationRequest {
        +dict~str,str~ files
        +str style
        +str prompt
        +str session_id
    }
    class CodeRequest {
        +str code
        +str language
        +str question
        +str session_id
    }
    class ChatRequest {
        +str message
        +str session_id
    }
    class JavaClassInfo {
        +str name
        +str package
        +list~str~ methods
        +list~str~ fields
        +str extends
        +list~str~ implements
        +bool is_interface
        +bool is_abstract
        +list~str~ imports
    }
    class PythonModuleInfo {
        +str module
        +list~str~ functions
        +list~str~ classes
        +int max_loop_depth
    }
    class ProjectTranslationPlan {
        +list~ProjectFileEntry~ ordered_files
        +bool had_cycle
    }
    class ProjectFileEntry {
        +str filename
        +JavaClassInfo class_info
        +list~str~ dependencies
    }
    class SessionStore {
        +dict _store
        +new_session_id() str
        +get_history(user_sub, session_id) list
        +append_exchange(user_sub, session_id, user_msg, assistant_msg)
        +clear_session(user_sub, session_id)
        +purge_expired() int
    }
    class AuditRecord {
        +str ts
        +str user_id
        +str action
        +str tool
        +str model_version
        +list data_sources
        +bool blocked
    }
    TranslationRequest --> JavaClassInfo : parsed by java_analyzer
    ProjectTranslationRequest --> ProjectTranslationPlan : planned by project_translator
    ProjectTranslationPlan "1" --> "*" ProjectFileEntry : ordered_files
    ProjectFileEntry --> JavaClassInfo : class_info
    CodeRequest --> PythonModuleInfo : analyzed by python_analyzer
    ChatRequest --> SessionStore : session_id lookup
    TranslationRequest --> AuditRecord : logged as
    CodeRequest --> AuditRecord : logged as
    ChatRequest --> AuditRecord : logged as
Loading

(back to top ↑)


🔄 Primary Capabilities

☕ Java → Python Translation

Submit any Java class or file — the platform pre-parses its structure with java_analyzer, enriches the prompt with class metadata, and returns idiomatic Python 3.12+.

📐 Full Java → Python Type Mapping Reference
Java Python Notes
int / long / short int Python int is arbitrary precision
float / double float
boolean bool
char str Single character string
String str
void return -> None
byte[] bytes
Object Any From typing
ArrayList<T> list[T]
LinkedList<T> collections.deque[T]
HashMap<K,V> dict[K, V]
LinkedHashMap<K,V> dict[K, V] Insertion-ordered since Python 3.7
HashSet<T> set[T]
Optional<T> `T </sub> None`
T[] list[T]
abstract class class Foo(ABC) from abc import ABC, abstractmethod
interface class Foo(Protocol) from typing import Protocol
enum class Foo(enum.Enum)
getter/setter pair @property + setter
static final field Module-level CONSTANT
Builder pattern @dataclass or kwargs
try-with-resources with statement Context manager
switch/case match/case Python 3.10+
String.format(...) f-string
System.out.println(...) print() / logging.info()
instanceof isinstance()
null None

Tip

Use "style": "idiomatic" for production Python. Use "style": "typed" when strict mypy / pyright compliance is required. Use "style": "literal" when you need a side-by-side Java reference for manual review.

curl -X POST http://localhost:8000/api/v1/translate \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "code": "public class OrderService { private List<Order> pending = new ArrayList<>(); public void enqueue(Order o) { pending.add(o); } }",
    "style": "idiomatic",
    "session_id": ""
  }'

Response includes session_id — pass it back in subsequent requests to continue the conversation:

# Follow-up: "now make all methods async"
curl -X POST http://localhost:8000/api/v1/translate \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "now make all methods async", "style": "idiomatic", "session_id": "a1b2c3d4-..."}'

Press Ctrl+Shift+T in the VS Code extension to open the translate panel directly.

🗂️ Multi-file Project Translation

Submit a whole Java project as a {filename: source} dict. The platform:

  1. Parses every file with java_analyzer to extract class names, extends, and implements relationships
  2. Builds a directed dependency graph — OrderService → Order, PaymentProcessor → Order, Customer, etc.
  3. Sorts using Kahn's algorithm (topological sort) so base classes are translated before their subclasses
  4. Retrieves one shared RAG context block from pgvector for the whole project
  5. Translates each file in order, injecting already-translated dependency classes into each subsequent prompt so naming, types, and import patterns are consistent across the entire output package

Note

Circular dependencies are handled gracefully — a had_cycle: true flag is returned and files in a cycle are translated in an arbitrary safe order. Human review is recommended for cyclic class graphs.

curl -X POST http://localhost:8000/api/v1/translate-project \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "files": {
      "Order.java": "public class Order { private String id; private double amount; }",
      "OrderService.java": "public class OrderService { private List<Order> pending = new ArrayList<>(); public void enqueue(Order o) { pending.add(o); } }"
    },
    "style": "idiomatic",
    "prompt": "use FastAPI patterns where applicable"
  }'

Response:

{
  "files": {
    "Order.java": "@dataclass\nclass Order:\n    id: str\n    amount: float",
    "OrderService.java": "@dataclass\nclass OrderService:\n    _pending: list[Order] = field(default_factory=list)\n    def enqueue(self, o: Order) -> None: ..."
  },
  "dependency_order": ["Order.java", "OrderService.java"],
  "rag_context_used": true,
  "had_cycle": false,
  "session_id": ""
}

📋 Requirements → Python Scaffold

Submit a legacy requirements document (plain text, exported Word doc converted to text) and receive a runnable Python module with:

  • Typed function stubs — one per identifiable requirement
  • Google-style docstrings that quote the original requirement text verbatim
  • pytest stubs in the same file that engineers extend rather than write from scratch

Note

Requirements documents with explicit REQ-NNN: numbering produce the most precise scaffolds. Free-form prose is also supported — the LLM extracts implicit requirements from natural language.

curl -X POST http://localhost:8000/api/v1/translate-requirements \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"text": "REQ-001: The system shall validate user email addresses before account creation.\nREQ-002: The system shall hash passwords with bcrypt cost factor >= 12."}'

🧠 Multi-turn Session Memory

Every data-bearing endpoint accepts an optional session_id field. When provided, the platform:

  1. Loads the prior conversation turns for {user_sub}:{session_id} from the in-process session store
  2. Trims history to a MAX_INJECTED_CHARS=3000 budget (newest turns kept) so token limits are respected
  3. Injects history between the system prompt and the current user message — the standard pattern for multi-turn LLM conversations
  4. Saves the new exchange, pruning at MAX_HISTORY_TURNS=20 stored rounds
  5. Returns the session_id in the response so the client can pass it back next turn

This enables true iterative refinement without re-submitting context:

Turn 1: translate OrderService.java          → session_id: "a1b2-..."
Turn 2: "now make all methods async"         → session_id: "a1b2-..." (LLM sees Turn 1)
Turn 3: "add type annotations throughout"    → session_id: "a1b2-..." (LLM sees Turns 1+2)

The /chat endpoint is the pure conversational interface — no code analysis enrichment, just history-aware free-form Q&A:

# Start a new session
curl -X POST http://localhost:8000/api/v1/chat \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"message": "What Python equivalent should I use for Java Optionals?"}'
# Response: {"response": "...", "session_id": "a1b2c3d4-..."}

# Continue the same session
curl -X POST http://localhost:8000/api/v1/chat \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"message": "Can you show me an example with a database result?", "session_id": "a1b2c3d4-..."}'

Session lifecycle:

  • Sessions expire after 60 minutes of inactivity (TTL=3600s)
  • A background task runs every 10 minutes to purge expired sessions
  • Sessions are scoped by user_sub — session IDs cannot be hijacked across users
  • Delete a session explicitly: DELETE /api/v1/session/{session_id}
stateDiagram-v2
    [*] --> Active : first turn (auto session_id generated)
    Active --> Active : subsequent turns (session_id provided)
    Active --> Expired : no activity for 60 min
    Active --> Cleared : DELETE /session/{id}
    Expired --> [*] : purge_expired() background sweep
    Cleared --> [*]
Loading
📋 All API Endpoints
Method Endpoint Permission Session Memory Description
POST /api/v1/translate translate Java source → idiomatic Python 3.12+ (RAG-enriched)
POST /api/v1/translate-project translate Multi-file Java project → Python package (dependency-ordered, RAG-enriched)
POST /api/v1/translate-requirements translate Requirements doc → typed Python scaffold (RAG-enriched)
POST /api/v1/assist code_assist General code assistance (Python & Java)
POST /api/v1/review review OWASP-informed code review
POST /api/v1/analyze-algorithm code_assist Big-O complexity analysis
POST /api/v1/generate-tests test_gen pytest / JUnit 5 test scaffold generation
POST /api/v1/chat code_assist Pure multi-turn conversational endpoint
DELETE /api/v1/session/{session_id} code_assist Clear conversation history for a session
GET /api/v1/health none Kubernetes liveness / readiness probe

All data-bearing endpoints follow the same pipeline:
sanitize()require_permission()get_history()call_llm(history=...)validate_output()append_exchange()audit.log()

(back to top ↑)


🤖 Agent Architecture

The platform is intentionally split into specialized agents so each stage of the workflow is isolated, testable, and auditable.

Agent Primary Responsibility Inputs Outputs Where It Lives
Orchestrator Agent Coordinates end-to-end request flow and policy enforcement API request, user identity, config Final API response, audit event services/orchestrator-python/src/main.py, services/orchestrator-python/src/api/routes.py
Auth/RBAC Agent Verifies JWT and enforces per-endpoint permissions Bearer token, required permission Authorized user context or 401/403 services/orchestrator-python/src/core/auth.py
Input Guardrail Agent Blocks prompt injection, secret leakage, and oversized payloads before model access Raw code/prompt text Sanitized text or 400 services/orchestrator-python/src/guardrails/input_guard.py
Java Analysis Agent Parses Java structure for class metadata and dependency extraction Java source code JavaClassInfo metadata services/orchestrator-python/src/tools/java_analyzer.py
Project Translation Planner Agent Builds dependency graph and topological order for multi-file translation {filename: java_source} project map Ordered translation plan, cycle detection services/orchestrator-python/src/tools/project_translator.py
RAG Retrieval Agent Retrieves relevant internal context from pgvector Query built from class metadata and prompt Ranked context snippets services/orchestrator-python/src/rag/retriever.py
Prompt Construction Agent Produces deterministic translation/review/test prompts with fixed rules Sanitized input, metadata, style, RAG context LLM-ready prompt services/orchestrator-python/src/tools/translation_tools.py
LLM Provider Agent Sends requests to approved model backends and enforces egress/provider lock policy Prompt + chat history + provider settings Raw model response services/orchestrator-python/src/core/llm_client.py, services/orchestrator-python/src/core/provider_lock.py
Output Guardrail Agent Redacts secrets and policy-violating output patterns Raw model response Safe response payload services/orchestrator-python/src/guardrails/output_guard.py
Session Memory Agent Maintains per-user conversational state and TTL cleanup user_sub, session_id, messages Retrieved/appended conversation history services/orchestrator-python/src/core/session_store.py, services/orchestrator-python/src/core/session_backend.py
Audit Agent Writes immutable metadata-only event records for traceability Action metadata, user id, tool/model id, blocked flag JSONL audit event services/orchestrator-python/src/core/logging.py
Client Agents (VS Code/Web) Collect user requests and render responses for engineers User actions and API payloads Structured calls to orchestrator APIs frontend/vscode-extension/, frontend/web-app/

Agent Execution Order

  1. Client Agent sends request to API.
  2. Orchestrator Agent receives and routes by endpoint.
  3. Auth/RBAC Agent validates identity and permission.
  4. Input Guardrail Agent sanitizes or blocks request.
  5. Analysis/Planner Agents parse Java and compute project order (when applicable).
  6. RAG Retrieval Agent fetches relevant internal context (when enabled).
  7. Prompt Construction Agent assembles model-ready instructions.
  8. LLM Provider Agent calls approved internal/government model backend.
  9. Output Guardrail Agent redacts sensitive output.
  10. Session Memory Agent stores the conversation turn.
  11. Audit Agent records immutable metadata.
  12. Orchestrator Agent returns the final response to the client.

This layered agent model is why the platform can answer the core assurance questions for every request: did it do what it should do, did it avoid what it must not do, and did it remain safe under adverse input conditions.

(back to top ↑)


🛡️ Security Model

Every request passes through five independent defence layers before any LLM output reaches the caller.

flowchart LR
    INPUT["User Input"] --> L1
    L1["① Input Guardrail\ninjection · secrets · size"] -->|blocked| ERR1["HTTP 400"]
    L1 -->|clean| L2["② JWT Auth + RBAC\nRS256 · role→permission"]
    L2 -->|denied| ERR2["HTTP 401/403"]
    L2 -->|authorized| L3["③ Hardened System Prompt\nanti-override · anti-exfil"]
    L3 --> L4["④ Output Guardrail\ncredential redaction"]
    L4 --> L5["⑤ Audit Log\nappend-only JSONL"]
    L5 --> RESP["Response to caller"]
Loading

Caution

The ENABLE_GUARDRAILS=false environment variable must never be set in production. It exists solely for isolated unit-test environments where the LLM server is mocked. Setting it on a live system removes layers ① and ④ entirely.

RBAC Permission Matrix

Permission engineer contractor admin
translate
code_assist
docs
review
test_gen
refactor
admin

Network Isolation

The K8s NetworkPolicy in infra/k8s/network-policy.yaml implements default-deny with an explicit allowlist:

Direction Allowed Peer Port Purpose
Ingress Internal API gateway 8000 API traffic
Egress LLM inference service 8080 LLM calls
Egress pgvector (PostgreSQL) 5432 RAG index
Egress Internal IdP 443 JWKS / token validation
Any Internet any ❌ Blocked — no rule exists

(back to top ↑)


📊 Capability Distribution

pie title API Capability Surface
    "Translation (Java→Python, Reqs→Python)" : 40
    "Security (guardrails, auth, audit)" : 30
    "Static Analysis (AST, Big-O)" : 15
    "RAG & Knowledge Retrieval" : 10
    "Infrastructure & Health" : 5
Loading

40% of the platform's surface area is dedicated to translation — reflecting its primary mission. The remaining 60% is security infrastructure that protects every translation request and secondary capability.

(back to top ↑)


� Pluggable LLM Backend

The orchestrator is designed around a provider-neutral LLM client (core/llm_client.py). The same binary can talk to a self-hosted GPU cluster, a DISA-authorized government cloud, or an on-premises disconnected appliance — determined entirely by environment variables. No code changes. No recompile. No redeployment.

┌──────────────────────────────────────────────────────────────────┐
│                  Secure LLM Orchestrator                          │
│                                                                    │
│  routes.py → llm_client.py → [ Provider Switch ]                  │
│                                      │                             │
│             ┌────────────────────────┼────────────────────────┐   │
│             ▼                        ▼                        ▼   │
│   LLM_PROVIDER=ollama    LLM_PROVIDER=vllm    LLM_PROVIDER=azure  │
│   (dev, unclassified)    (prod, on-prem)      (gov cloud, all IL) │
│   Ollama on GPU node     vLLM on GPU cluster  Azure OpenAI Gov    │
└──────────────────────────────────────────────────────────────────┘

Which LLMs Are Secure?

The table below covers every LLM option tested or integrated with this project, their US government authorization status, and whether they require cloud connectivity.

What is an ATO?

An Authority to Operate (ATO) is the formal written authorization from a senior government official (the Authorizing Official, or AO) that permits an IT system to operate within a specific environment at a specific classification level. Without an ATO, a system — or a component like an LLM — cannot legally process government data in that environment, regardless of technical security controls. An ATO is not a product certification; it is tied to a specific system, environment, and risk acceptance decision. ATOs can be revoked at any time if new risks are discovered.

For AI/ML systems, the ATO process evaluates the model's supply chain (country of origin, training data provenance, update mechanisms), its behavior under adversarial inputs, data handling practices, and alignment with the owning agency's security requirements. A model with no US government ATO is not prohibited from use on unclassified systems, but it may not process Controlled Unclassified Information (CUI), Classified National Security Information (CNSI), or any data above the authorized classification level.

What is the RMF?

The Risk Management Framework (RMF) (NIST SP 800-37) is the six-step process the US federal government uses to authorize IT systems:

Step Name What Happens
1 Categorize Classify the system's information (Confidentiality / Integrity / Availability impact levels per FIPS 199)
2 Select Choose security controls from NIST SP 800-53 appropriate to the impact level
3 Implement Apply the selected controls to the system
4 Assess An independent assessor (3PAO for FedRAMP, or a DoD SCA for IL systems) verifies the controls work
5 Authorize The Authorizing Official reviews residual risk and signs the ATO (or issues a Denial of ATO)
6 Monitor Continuous monitoring — vulnerabilities, configuration drift, and new threats are tracked; ATO is re-evaluated annually or on significant change

"RMF-processable" in the table below means the model is US-origin, has no known supply-chain disqualifiers, and can be submitted through an agency's RMF process to earn an ATO — but no blanket government-wide ATO has been issued yet. The system owner must still complete steps 1–5 for their specific environment.

Warning

Qwen, DeepSeek, and other PRC-origin models are permanently blocked at the runtime level. core/provider_lock.py rejects these models at startup and on every call — no configuration can re-enable them. The blocked-pattern list covers: Qwen (Alibaba), DeepSeek, Baichuan, InternLM, ChatGLM/GLM, MiniMax, Moonshot/Kimi. None of these models hold a US government ATO at any classification level.

Model Country of Origin Cloud Required? Formal US Gov Authorization Set LLM_MODEL to…
Llama 3.3 70B Instruct 🇺🇸 USA (Meta) ❌ On-prem ⚠️ No blanket ATO; US-origin; RMF-processable llama3.3:70b
Llama 3.1 405B 🇺🇸 USA (Meta) ❌ On-prem ⚠️ No blanket ATO; highest open-weight capability llama3.1:405b
Mistral Large / Codestral 🇫🇷 France (EU) ❌ On-prem ⚠️ No US ATO; non-PRC; EU data jurisdiction mistral-large
Defense Llama 🇺🇸 USA (Scale AI + Meta) ❌ Gov-controlled env ✅ Deployed in classified gov environments (Nov 2024) via Scale Donovan Contact Scale AI
Claude 3.5 Sonnet v1 / Haiku 🇺🇸 USA (Anthropic) ✅ AWS GovCloud ✅ FedRAMP High + DoD IL4/IL5 — AWS Bedrock GovCloud (May 2025) via Bedrock adapter
Azure OpenAI GPT-4o 🇺🇸 USA (Microsoft/OpenAI) ✅ Azure Gov cloud All levels: FedRAMP High (Aug 2024) · IL4/IL5 (Sep 2024) · IL6 Secret (Feb 2025) · ICD-503 Top Secret (Jan 2025) (set LLM_PROVIDER=azure)
Azure OpenAI GPT-4o-mini 🇺🇸 USA (Microsoft/OpenAI) ✅ Azure Gov cloud ✅ Same authorization as GPT-4o; lower cost (set LLM_PROVIDER=azure)
Azure Local + Foundry Local 🇺🇸 USA (Microsoft) ❌ Fully on-prem ✅ Runs large models fully disconnected — no cloud needed (Feb 2026) Contact Microsoft Federal

Legend: ✅ Authorized  |  ⚠️ US-origin, no formal ATO, RMF-processable  |  ❌ Not approved or cloud-unavoidable


Cloud vs On-Prem

The table below tells you at a glance whether a backend requires cloud connectivity, who controls the weights, and what hardware you need.

Backend Where Does It Run? Who Controls Weights? Min VRAM Connectivity Needed
Ollama (dev) Your GPU node (Docker) You (pulled once) 8 GB (quant) None after pull
vLLM (prod) Your GPU cluster (K8s) You 48 GB (Llama 3.3 70B) / 400 GB (405B) None
Azure OpenAI Gov Microsoft Azure Gov data centers Microsoft / OpenAI None (cloud) HTTPS to Azure Gov endpoint
Azure Local + Foundry Local Your on-prem hardware (NVIDIA GPU) You (Microsoft-managed models) Large GPU required None after initial setup
AWS Bedrock GovCloud Amazon GovCloud region Amazon / Meta / Anthropic None (cloud) HTTPS to AWS GovCloud
Scale Donovan (Defense Llama) Scale AI gov environment Scale AI (US cleared) None Classified network access

Air-gap note: For a fully disconnected classified network with no cloud connectivity at all, the only currently available options are vLLM on-prem with Llama 3.3 70B or Azure Local + Foundry Local (Feb 2026, requires Microsoft partnership). All cloud-based options require an HTTPS route to the respective government cloud endpoint.


Switching Providers

The provider is controlled by a single LLM_PROVIDER env var plus optional cloud credentials. Set these in .env for dev or a Kubernetes Secret for production — never in source code.

# ─── Option A: On-Prem Dev (Ollama, unclassified GPU lab) ────────────────────
LLM_PROVIDER=ollama
LLM_ENDPOINT=http://10.0.0.5:11434/v1    # ← internal host; public IPs are blocked
LLM_MODEL=llama3.3:70b                    # ← Meta (US-origin); RMF-processable
# PROVIDER_LOCK=false                    # default; set true when IT freezes config

# ─── Option B: On-Prem Production (vLLM, GPU cluster) ───────────────────────
LLM_PROVIDER=vllm
LLM_ENDPOINT=http://llm-inference:8080/v1
LLM_MODEL=llama3.3:70b
# PROVIDER_LOCK=true                     # ← set by IT to freeze this configuration

# ─── Option C: Azure OpenAI for U.S. Government ──────────────────────────────
#   Authorized: FedRAMP High · IL4 · IL5 · IL6 Secret · ICD-503 Top Secret
LLM_PROVIDER=azure
AZURE_OPENAI_ENDPOINT=https://<your-resource>.openai.azure.com
AZURE_OPENAI_KEY=<from-azure-portal-keys-and-endpoints>
AZURE_OPENAI_DEPLOYMENT=gpt-4o            # deployment name, not model name
AZURE_API_VERSION=2024-08-01-preview
PROVIDER_LOCK=true                         # ← IT sets this to prevent any override

When LLM_PROVIDER=azure, the client automatically switches to the Azure deployment-scoped URL (/openai/deployments/{deployment}/chat/completions?api-version=...) and uses the api-key header instead of Bearer. Everything else — session memory, guardrails, RAG context, RBAC — is identical across all three options.


Provider Lock — IT Security Runbook

The PROVIDER_LOCK=true flag freezes the LLM provider configuration so that no developer, user, or misconfigured env var can redirect traffic to a public internet endpoint. It is intended to be set by IT in the Kubernetes Secret before the first production deployment. Developers do not have permission to edit the Secret.

What PROVIDER_LOCK=true enforces at startup:

Check Action on failure
LLM_PROVIDER not in {ollama, vllm, azure} Process exits — pod restarts indefinitely
Model name matches a blocked pattern (Qwen, DeepSeek, etc.) Process exits
On-prem endpoint uses localhost or 127.0.0.1 Process exits — requires a real network address
On-prem endpoint resolves to a public cloud domain Process exits
Azure endpoint doesn't match `*.openai.azure.(com</sub> us)`
Any Azure credential field is empty Process exits

What is enforced on every LLM call (regardless of lock):

  • assert_egress_url_safe() checks the constructed URL against the public domain blocklist before opening the connection — no HTTP call is made if the URL fails validation.
  • Azure provider: URL must match https://[a-z0-9-]+\.openai\.azure\.(com\|us)/…
  • On-prem providers: any *.azure.com, *.openai.com, *.anthropic.com, *.googleapis.com, or *.amazonaws.com hostname is blocked.

IT Runbook — Kubernetes setup (example: on-prem vLLM, locked):

# 1. Create the Secret with all required fields — including PROVIDER_LOCK=true.
kubectl create secret generic orchestrator-secrets \
  --from-literal=LLM_PROVIDER=vllm \
  --from-literal=LLM_ENDPOINT=http://10.0.0.20:8080/v1 \
  --from-literal=LLM_MODEL=llama3.3:70b \
  --from-literal=PROVIDER_LOCK=true \
  -n llm-assistant

# 2. Remove Secret edit access from developer service accounts.
kubectl delete rolebinding developer-secret-edit -n llm-assistant 2>/dev/null || true

# 3. Deploy. The orchestrator will validate the config at startup and exit
#    if anything is misconfigured — check logs if the pod crashloops.
kubectl apply -f infra/k8s/orchestrator-deployment.yaml
kubectl rollout status deployment/llm-orchestrator -n llm-assistant

IT Runbook — Azure OpenAI Government, locked:

kubectl create secret generic orchestrator-secrets \
  --from-literal=LLM_PROVIDER=azure \
  --from-literal=AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com \
  --from-literal=AZURE_OPENAI_KEY=<key-from-portal> \
  --from-literal=AZURE_OPENAI_DEPLOYMENT=gpt-4o \
  --from-literal=AZURE_API_VERSION=2024-08-01-preview \
  --from-literal=PROVIDER_LOCK=true \
  -n llm-assistant

Why this is sufficient to prevent data leaving the facility: Even if someone injected LLM_ENDPOINT=https://api.openai.com into the environment after deployment, the assert_egress_url_safe() call in llm_client.py would raise ProviderConfigError before the HTTP connection is opened. The network-level NetworkPolicy in infra/k8s/network-policy.yaml provides a second layer that drops the packets at the CNI level. Both layers must be defeated independently to exfiltrate data.

Note

AWS Bedrock GovCloud (LLM_PROVIDER=bedrock) uses AWS Signature V4 authentication via boto3. The BedrockGovProvider validates GovCloud region (us-gov-west-1 / us-gov-east-1), credentials, and blocked model list at startup. Implemented in v3.0.


Classification Decision Matrix

Use this to pick the right provider for your deployment environment.

Enclave / Network Required Provider Setting Authorization Basis Cloud?
Unclassified dev (air-gapped GPU lab) LLM_PROVIDER=vllm + LLM_MODEL=llama3.3:70b On-prem; no external auth needed
Unclassified dev (laptop, no GPU) LLM_PROVIDER=ollama + LLM_MODEL=llama3.3:70b On-prem dev only
IL2 / Low (NIPRNet-adjacent) LLM_PROVIDER=azure + GPT-4o DISA FedRAMP High (Aug 2024)
IL4 / Moderate (CUI, SIPRNet-adjacent) LLM_PROVIDER=azure + GPT-4o DISA IL4 PA (Sep 2024)
IL5 / High (National Security Systems) LLM_PROVIDER=azure + GPT-4o DISA IL5 PA (Sep 2024)
IL6 / Secret LLM_PROVIDER=azure + GPT-4o DISA IL6 auth (Feb 2025) ✅ Gov
ICD-503 / Top Secret Azure OpenAI in Azure Gov Top Secret cloud ICD-503 auth (Jan 2025) ✅ Gov TS
Fully disconnected / sovereign (no cloud) Azure Local + Foundry Local on-prem Microsoft Sovereign Cloud (Feb 2026)
IL4/IL5 — AWS preference LLM_PROVIDER=bedrock (roadmap) + Llama 3 70B or Claude 3.5 FedRAMP High + IL4/IL5 AWS GovCloud (May 2025) ✅ AWS Gov

(back to top ↑)


�🛠️ Technology Stack

See 🔌 Pluggable LLM Backend for the full LLM security guide, cloud vs on-prem comparison, and provider switching instructions.

Detailed Stack Rationale

🐍 Python 3.12+

What it does: Runtime for the entire orchestrator service — API layer, guardrails, session store, LLM client, static analysis, and RAG retriever all run in the same Python process under asyncio.

Why chosen: Python has the deepest ecosystem for LLM tooling, AST analysis, and async I/O. The ast module gives zero-dependency Python static analysis. match/case (3.10+) maps cleanly to Java switch/case in translation rules. asyncio.Lock enables safe per-session concurrency without threading complexity. The async story with FastAPI + httpx + asyncpg is cohesive and well-maintained.

How it helps: Every I/O-bound operation (LLM call, pgvector query, embedding call) runs concurrently without blocking — a single server process can pipeline multiple translation requests in parallel. Python 3.14 is verified working on this project's Arch Linux dev machine; Docker targets python:3.12-slim for stable production images.

Alternatives considered: Go — excellent performance but minimal LLM/AST ecosystem; Java — ironic choice for a Java→Python translator; Rust — no mature async HTTP server with Pydantic-equivalent validation.


⚡ FastAPI 0.115+

What it does: REST API framework serving all 10 endpoints. Handles request parsing (Pydantic v2), dependency injection (Depends), lifespan context management (background session purge task), and OpenAPI schema generation.

Why chosen: FastAPI is the only Python framework that natively combines: async ASGI handlers, Pydantic v2 request/response validation, Depends() for clean per-request auth injection, and an asynccontextmanager lifespan hook for background tasks — all without boilerplate. It generates interactive OpenAPI docs (disabled in prod, enabled in dev via ENV=dev) that the VS Code extension and web app use during integration testing.

How it helps: Every CodeRequest, TranslationRequest, and ChatRequest is validated by Pydantic before reaching any business logic — invalid inputs get a 422 Unprocessable Entity before they even touch the guardrail layer. The Depends(require_permission(...)) pattern enforces RBAC at the framework level; a missing permission header is a 401 before any handler code runs.

Alternatives considered: Flask — synchronous by default, no native dependency injection, no validation layer; Django REST Framework — heavyweight ORM/admin apparatus we don't need; Starlette — FastAPI's foundation, but FastAPI adds the Pydantic layer we need at no cost.


🔑 PyJWT 2.8+ with RS256

What it does: Validates incoming Authorization: Bearer <token> headers. Fetches the JWKS from the internal IdP on first request (cached), verifies the RS256 signature, decodes the claims, and extracts sub (user identifier) and roles.

Why chosen: RS256 (asymmetric JWT) means the orchestrator only holds the public key — it can verify tokens but never mint them. If the orchestrator is compromised, attackers cannot forge tokens. PyJWT is RFC 7519 compliant, has zero transitive dependencies beyond cryptography, and is the de facto standard in the Python ecosystem.

How it helps: The sub claim is the user's unique identifier used as the session store scope key ({sub}:{session_id}), ensuring session isolation across users. The roles claim drives RBAC — engineer, contractor, admin roles map to different permission sets without any database lookup on the critical request path.

Alternatives considered: python-jose — adds JWKS fetching but pulls in additional dependencies; authlib — heavier OAuth/OIDC library; rolling our own — never.


📦 Pydantic-settings 2.3+

What it does: Loads all configuration from environment variables at boot time into a typed Settings dataclass. Exposes LLM_ENDPOINT, LLM_MODEL, VECTOR_DB_URL, EMBEDDING_ENDPOINT, RAG_ENABLED, ENABLE_GUARDRAILS, and all other config as typed Python attributes.

Why chosen: Fail-fast validation at startup — if LLM_ENDPOINT is not set, the service refuses to start with a clear error rather than silently failing on the first LLM call. Type coercion handles RAG_ENABLED=truebool, RAG_TOP_K=5int automatically. The single settings singleton is importable everywhere without passing config through call chains.

How it helps: Eliminates the entire class of "wrong type in env var" bugs. Secrets stay in .env / K8s Secrets and never touch source code. model_config = SettingsConfigDict(env_file=".env") makes local dev identical to production — same code path, different env vars.

Alternatives considered: dynaconf — powerful but complex layering model; python-decouple — simpler but no type validation; raw os.environ — no validation, no types.


🌐 httpx 0.27+

What it does: Makes all outbound HTTP calls: LLM /chat/completions requests to the Ollama/vLLM server, and embedding requests to the nomic-embed-text embedding server.

Why chosen: httpx is the only async Python HTTP client with a clean AsyncClient context manager, per-request timeout configuration, and automatic connection pooling — all without the callback-based API of aiohttp. The timeout=httpx.Timeout(connect=5.0, read=120.0) pattern sets a short connect timeout (fast failure on unreachable LLM) and a long read timeout (72B model can take ~90s to generate a large translation).

How it helps: Used as an async context manager inside call_llm() — no shared global session means each request gets a fresh connection from the pool, avoiding state leakage between concurrent requests. The consistent interface makes it easy to swap between Ollama (dev) and vLLM (prod) endpoints without changing client code.

Alternatives considered: aiohttp — callback-style API with more boilerplate; requests — synchronous, blocks the event loop; urllib3 — lower-level, no async.


☕ javalang 0.13+

What it does: Pure-Python Java lexer and parser. Given a Java source string, it returns an AST from which java_analyzer.py extracts class name, package, methods, fields, extends, implements, interface/abstract flags, and imports — all without a JVM.

Why chosen: The alternative to pure-Python parsing is running a JVM sidecar (Eclipse JDT, javac, or ANTLR4 Java grammar with the Java runtime). On an air-gapped GPU node, installing and maintaining a JVM just for parsing is a significant operational burden. javalang parses the full Java 8 grammar in microseconds and is a single pip install.

How it helps: The metadata extracted — class name, method signatures, package — is injected into every translation prompt as structured context. This tells the LLM exactly what it's translating before it reads the raw code, dramatically reducing hallucinated class names, missing imports, and incorrect method signatures in the Python output. The RAG query is also built from this metadata (build_rag_query(class_name, methods, package)) for more precise vector search results.

Alternatives considered: Eclipse JDT via subprocess — requires JVM, complex setup; ANTLR4 Java grammar — requires Java runtime for the grammar runtime; tree-sitter-java (Python bindings) — viable alternative but requires a native compiled extension.


🗄️ PostgreSQL 16 + pgvector

What it does: Stores the RAG index — chunked source code and documentation snippets with 768-dimensional nomic-embed-text vector embeddings. The retriever.py module embeds the query, runs an ivfflat approximate nearest-neighbour search via asyncpg, and returns the top-k most semantically similar chunks as context for translation prompts.

Why chosen: pgvector is a Postgres extension — not a standalone service. The infrastructure team already operates and backs up the Postgres cluster. Adding vector search requires CREATE EXTENSION vector and one DDL migration — no new service, no new runbook, no new vendor. The ivfflat index with lists=100 gives sub-10ms query times on a 500k-document index.

How it helps: When translating OrderService.java, the retriever finds internal Python code that already uses Order data models, naming conventions, and import patterns from the organisation's codebase. Injecting this context into the translation prompt anchors the LLM to internal conventions — the output uses from models.order import Order instead of inventing its own import path. This is the single biggest quality multiplier for multi-file project translation.

Alternatives considered: Qdrant — excellent vector DB but a new service the infra team would need to learn; Weaviate — similar objection; Redis Stack — vector search available but weaker filtering; FAISS in-process — no persistence, no classification filtering.


🧬 nomic-embed-text (768-dim embeddings)

What it does: Self-hosted embedding model served at http://embedding-server:8080/v1/embeddings. Converts source code snippets and query strings into 768-dimensional dense vectors for semantic similarity search.

Why chosen: nomic-embed-text is purpose-built for long-document embedding (8192-token context), is available under Apache 2.0, runs on a single GPU, and serves an OpenAI-compatible /v1/embeddings endpoint. At 768 dimensions it balances recall quality against pgvector index size — 1M documents at 768 dims uses ~3 GB of vector storage.

How it helps: Code snippets are semantically dense — List<Order> pending and list[Order] are syntactically different but semantically identical. A code-aware embedding model captures this equivalence; a generic text model might not. Using the same model for both indexing and retrieval guarantees the vector space is consistent.

Alternatives considered: OpenAI text-embedding-3-small — cloud call, violates air-gap; all-MiniLM-L6-v2 — 384 dims, lower recall on code; text-embedding-ada-002 — 1536 dims, larger index, cloud-only.


🤖 Ollama (dev) / vLLM (prod)

What it does: Serves the LLM on internal GPU nodes behind an OpenAI-compatible /v1/chat/completions API. Ollama is used in development and CI; vLLM is used in production for higher throughput and tensor parallelism across multiple GPUs.

Why chosen — Ollama (dev): One-command model management (ollama pull llama3.3:70b), automatic quantisation, minimal setup. Single GPU, single engineer, zero config. Ideal for local development where request throughput is not a concern. PRC-origin models (Qwen, DeepSeek, etc.) are blocked at the application layer via core/provider_lock.py and will cause startup failure if configured.

Why chosen — vLLM (prod): PagedAttention enables continuous batching — multiple concurrent translation requests are batched into a single GPU forward pass, doubling effective throughput vs naive serving. Tensor parallelism splits the 72B model across multiple GPUs. The OpenAI-compatible API means zero code changes between dev (Ollama) and prod (vLLM) — only LLM_ENDPOINT changes.

How it helps: Temperature is always set to 0.0 — deterministic LLM output is critical for a translation tool. The same Java input must produce the same Python output on re-run; temperature > 0 introduces non-determinism that breaks regression tests and makes output review harder.

Alternatives considered: HuggingFace TGI — less stable batching, complex config; NVIDIA Triton — enterprise-grade but heavy; llama.cpp server — see note below.

llama.cpp server in a secure environment — when it IS the right choice: llama.cpp is a single C++ binary with no cloud dependencies, no telemetry, and no runtime package manager. For the right deployment it is not just acceptable — it is preferable:

Scenario llama.cpp Ollama vLLM
No GPU available (CPU-only, classified SCIF) ✅ Best option ⚠️ Slow ❌ Requires CUDA
Edge / disconnected device (Jetson, NUC) ⚠️
Single low-VRAM GPU (8–16 GB, dev) ✅ 4-bit GGUF ⚠️ Limited
Multi-GPU production cluster (high concurrency) ❌ Single-threaded batching ✅ Best option
Minimal attack surface / no Docker daemon ✅ Single binary ❌ Requires Docker

Security posture: llama.cpp is C++ — memory safety vulnerabilities are possible in principle, but the codebase is widely reviewed and has no network calls outside the inference API port you bind. It holds no API keys, makes no outbound connections, and carries no supply-chain runtime dependencies. This is a smaller attack surface than Ollama (Go binary + Docker) or vLLM (Python + CUDA + dozens of PyPI packages).

Model security is unchanged: The inference server is irrelevant to model supply-chain security. A GGUF-quantized llama3.3:70b served by llama.cpp is the same weights as the same model served by vLLM — core/provider_lock.py blocks PRC-origin models regardless of which server is used.

Recommendation: Use llama.cpp when you need a CPU-only or edge deployment, or when minimizing attack surface on a resource-constrained classified system. Use vLLM when you need multi-GPU throughput with concurrent users. The LLM_ENDPOINT env var is all that changes — no code differences.


☸️ Kubernetes 1.28+

What it does: Production container orchestration. The NetworkPolicy in infra/k8s/network-policy.yaml implements default-deny egress with an allowlist of exactly four internal peers: API gateway (8000), LLM inference (8080), pgvector (5432), IdP (443). No internet egress rule exists.

Why chosen: Kubernetes NetworkPolicy is the correct tool for enforcing network-level isolation in a classified environment — it operates at the CNI layer and cannot be bypassed by application code. Rolling deploys ensure zero-downtime updates. Namespace isolation provides a blast-radius boundary: a compromised orchestrator pod cannot reach other namespaces.

How it helps: The NetworkPolicy is the last line of defence for the air-gap guarantee. Even if a vulnerability allowed arbitrary outbound calls from application code, the CNI-level policy would drop the packets. This cannot be achieved with application-level controls alone.

Alternatives considered: Nomad — less mature RBAC and NetworkPolicy equivalents; bare Docker — no network isolation primitives; ECS — AWS dependency, incompatible with on-prem GPU nodes.


🔧 asyncpg 0.29+

What it does: Async PostgreSQL driver. Used exclusively in rag/retriever.py to execute pgvector nearest-neighbour queries without blocking the FastAPI event loop.

Why chosen: asyncpg is the fastest Python PostgreSQL driver — it speaks the binary wire protocol directly without converting to/from Python strings. A vector similarity search on a 500k-row table completes in ~8ms with asyncpg vs ~40ms with psycopg2 (synchronous). The async interface means concurrent translation requests each run their RAG query in parallel during the same event loop iteration.

Alternatives considered: psycopg3 async — viable but slower than asyncpg; SQLAlchemy async — adds ORM overhead we don't need for two queries; psycopg2 — synchronous, blocks the event loop.


🔒 asyncio.Lock (in-process session store)

What it does: The session store (core/session_store.py) uses asyncio.Lock at two granularities: a global _store_lock for the session map and a per-session lock on each _Session dataclass. This prevents concurrent requests on the same session from corrupting conversation history.

Why chosen: The session store is in-process (no Redis, no external service). For a single-replica deployment this is correct and fast — no network round-trips for history access. asyncio.Lock is zero-overhead compared to a mutex (no kernel calls) because Python's asyncio is cooperative — only one coroutine runs at a time. The lock only matters for async def coroutines that could be interleaved at await points.

How it helps: Without per-session locking, two concurrent follow-up requests on the same session (e.g., a VS Code extension and a web app tab both open) could both read the same history, race to append their exchange, and corrupt the turn sequence. The lock serialises this safely.

Note on scaling: Set SESSION_BACKEND=redis and REDIS_URL=redis://your-cluster:6379/0 to switch to the Redis/Valkey-backed session store (core/redis_session_store.py) for multi-replica deployments. The memory backend remains the default for single-replica development environments.

Technology Version Role Why Chosen
Python 3.12+ Orchestrator runtime Async ecosystem, ast module, match/case
FastAPI 0.115+ REST framework Pydantic v2, Depends() injection, lifespan hooks
asyncio stdlib Concurrency Per-session locking, background purge task
PyJWT 2.8+ RS256 JWT validation Asymmetric; orchestrator holds public key only
pydantic-settings 2.3+ Typed config Fail-fast at boot; zero runtime type surprises
httpx 0.27+ LLM + embedding HTTP client Async; per-request timeout; no global session
asyncpg 0.29+ pgvector queries Binary protocol; non-blocking; fastest Python PG driver
javalang 0.13+ Java parsing Pure Python; no JVM; class/method/field extraction
PostgreSQL 16 + pgvector pg16 Vector RAG index Reuses existing Postgres infra; ivfflat ANN search
nomic-embed-text 768-dim code embeddings Long-context (8192 tok); Apache 2.0; self-hosted
Ollama latest Dev LLM inference One-command model pull; OpenAI-compatible API
vLLM latest Prod LLM inference PagedAttention batching; tensor parallelism; same API
Kubernetes 1.28+ Production orchestration NetworkPolicy CNI-level air-gap enforcement
Docker Compose v2 Dev stack One-command full-stack; no K8s locally
Ansible 2.16+ GPU node hardening Idempotent; disables internet egress on bare metal

(back to top ↑)


🚀 Setup & Installation

Prerequisites

  • Docker Engine 24+ and Docker Compose v2, or Python 3.12+ with access to an internal LLM inference server
  • Internal network access to the OIDC/LDAP identity provider

Option A — Docker Compose (recommended for dev)

git clone https://github.com/hkevin01/secure-llm-assistant.git
cd secure-llm-assistant

docker compose -f docker/docker-compose.yml up -d ollama

# Pull Llama 3.3 70B — default model (Meta, US-origin, ~48 GB VRAM)
docker exec ollama ollama pull llama3.3:70b

# No GPU / low-VRAM machine? Use Llama 3.1 8B (~8 GB VRAM):
# docker exec ollama ollama pull llama3.1:8b
# Then set LLM_MODEL=llama3.1:8b in docker/docker-compose.yml

docker compose -f docker/docker-compose.yml up -d

curl http://localhost:8000/api/v1/health

Press Ctrl+C to stop the log stream; the containers keep running. Use docker compose -f docker/docker-compose.yml down to stop all services.

Note

The first pull of llama3.3:70b downloads ~48 GB. On a 1 Gbps internal link this takes ~7 minutes. Subsequent starts use the cached ollama-models volume and are fast. PRC-origin models (Qwen, DeepSeek, etc.) are blocked by core/provider_lock.py at startup — the service will refuse to start if a blocked model is configured.

Option B — Local Python (no Docker)

cd services/orchestrator-python
cp .env.example .env
pip install -r requirements.txt
uvicorn src.main:app --reload --port 8000

Option C — Kubernetes (production)

kubectl create namespace llm-assistant
kubectl create secret generic orchestrator-secrets \
  --from-env-file=services/orchestrator-python/.env \
  -n llm-assistant

kubectl apply -f infra/k8s/network-policy.yaml
kubectl apply -f infra/k8s/orchestrator-deployment.yaml

kubectl rollout status deployment/llm-orchestrator -n llm-assistant
kubectl exec -n llm-assistant deploy/llm-orchestrator -- \
  curl -s localhost:8000/api/v1/health

Run Tests

cd services/orchestrator-python
pytest tests/ -v
⚙️ Environment Variable Reference
Variable Default Required Description
ENV prod No Set dev to enable Swagger UI at /docs
LLM_ENDPOINT http://llm-inference:8080/v1 Yes Internal LLM server base URL (Ollama or vLLM)
LLM_MODEL llama3.3:70b No Model tag — PRC-origin models (Qwen, DeepSeek, etc.) are blocked at startup
PROVIDER_LOCK false No Set true to freeze provider config; blocks localhost, public endpoints, and blocked models at startup
LLM_TEMPERATURE 0.0 No Must remain 0 — ensures deterministic translation output
LLM_MAX_TOKENS 4096 No Max tokens per LLM response
EMBEDDING_ENDPOINT http://embedding-server:8080/v1 No nomic-embed-text server URL (RAG only)
EMBEDDING_MODEL nomic-embed-text No Embedding model name passed to the embeddings API
RAG_ENABLED true No Set false to disable pgvector retrieval entirely
RAG_TOP_K 5 No Number of RAG chunks injected per translation prompt
VECTOR_DB_URL If RAG_ENABLED asyncpg-format pgvector connection string
ALLOWED_ORIGINS https://llm.internal Yes CORS allowlist (comma-separated)
AUDIT_LOG_PATH /var/log/llm-assistant/audit.jsonl No Audit log file path (must be append-writable)
MAX_INPUT_TOKENS 8192 No Input size ceiling (characters); larger inputs are rejected
ENABLE_GUARDRAILS true No Never set false in production

(back to top ↑)


📖 API Reference

/translate — Java → Python

Request

{
  "code": "public class OrderService { ... }",
  "style": "idiomatic"
}

Response

{
  "python": "@dataclass\nclass OrderService:\n    _pending: list[Order] = field(default_factory=list)\n    ...",
  "java_metadata": {
    "name": "OrderService",
    "package": "com.example",
    "methods": ["enqueue", "process"],
    "fields": ["pending"],
    "extends": null,
    "complexity": "Approx metrics — methods: 2, loops: 1, branches: 3"
  }
}

/translate-requirements — Requirements → Python Scaffold

Request

{
  "text": "REQ-001: The system shall validate user email addresses before account creation."
}

Response

{
  "python": "def validate_email_address(email: str) -> bool:\n    \"\"\"...\"\"\"\n    raise NotImplementedError(\"REQ-001: ...\")\n\n# ── TESTS ──\ndef test_validate_email_address_valid(): ..."
}

Warning

All translation output is a scaffold requiring human review before production deployment. The LLM preserves business logic but cannot guarantee correctness of complex algorithmic code. Always run the generated tests and conduct a security review of the output.

(back to top ↑)


📈 Development Status

Version Stability Tests Python What's In It
v1.5 ✅ Stable 27 passing 3.12+ Orchestrator, JWT/RBAC, guardrails, Java→Python translation, requirements scaffold, Python/Java static analysis
v2.0 ✅ Stable 55 passing 3.12+ RAG retriever (pgvector + nomic-embed-text), multi-file project translation (dependency graph + topo sort), multi-turn session memory (/chat, session TTL, sliding window), VS Code extension (alpha)
v2.5 ✅ Complete 169 passing 3.12+ test_provider_lock.py (105 tests), Provider Lock (PROVIDER_LOCK=true), Qwen/DeepSeek model blocklist, egress URL safety enforcement, Azure OpenAI Government support, VS Code extension beta
v3.0 ✅ Complete 265 passing 3.12+ Incremental diff translation (/translate-diff), JetBrains plugin scaffold, multi-replica session store (Redis/Valkey), AWS Bedrock GovCloud adapter (LLM_PROVIDER=bedrock), model ATO evaluation framework (/evaluate-model-ato)

What shipped in v2.0 (this release):

  • src/rag/retriever.py — async pgvector nearest-neighbour retrieval; degrades gracefully on error
  • src/tools/project_translator.py — Kahn topological sort; dependency graph from extends/implements
  • src/tools/translation_tools.pySTYLE_DIRECTIVES constant; rag_context and extra_instructions params on all prompt builders
  • src/core/session_store.py — in-process session store; TTL=3600s; MAX_HISTORY_TURNS=20; MAX_INJECTED_CHARS=3000
  • src/core/llm_client.pyhistory: list[dict] | None parameter; history injected between system prompt and user message
  • src/main.py — lifespan context manager; background session purge task every 600s
  • src/api/routes.py — all 8 data endpoints wired with get_history / append_exchange; new /chat endpoint; new DELETE /session/{id}; session_id in all request/response models
  • ✅ All endpoints return session_id in response for client tracking

Verified working on: Python 3.12, 3.13, 3.14 (Arch Linux). Docker images target python:3.12-slim.

(back to top ↑)


🗺️ Roadmap

timeline
    title Release History & Planned Milestones
    2025-Q4 : v1.0 — Orchestrator, guardrails, JWT/RBAC
    2026-Q1 : v1.2 — Java Structural Analyzer, Algorithm Tools
    2026-Q2 : v1.5 — Java→Python Translation, Requirements Scaffold
    2026-Q2 : v2.0 — RAG, Multi-file Translation, Session Memory (current)
    2026-Q3 : v2.5 — Provider Lock, PRC model blocklist, test_provider_lock.py, VS Code Extension beta
    2026-Q4 : v3.0 — Incremental Diff, JetBrains Plugin, Multi-replica Sessions, Bedrock Adapter
Loading
gantt
    title Secure LLM Assistant — Active Roadmap
    dateFormat  YYYY-MM
    section Core Platform
        Orchestrator + Guardrails       :done,    p1, 2025-10, 2025-12
        JWT/RBAC Auth                   :done,    p2, 2025-11, 2025-12
        Java→Python Translation         :done,    p3, 2026-01, 2026-04
        Requirements→Python Scaffold    :done,    p4, 2026-03, 2026-04
        RAG Retriever (pgvector)        :done,    p5, 2026-03, 2026-04
        Multi-file Project Translation  :done,    p6, 2026-04, 2026-04
        Multi-turn Session Memory       :done,    p7, 2026-04, 2026-04
    section Quality & Tooling
        LLM Provider Switching + Provider Lock :done,    q0, 2026-04, 2026-04
        test_provider_lock.py (105 tests)       :done,    q1b,2026-04, 2026-04
        test_session_store.py                   :done,    q2, 2026-04, 2026-04
        Golden-file Translation Tests       :active,  q1, 2026-04, 2026-07
        VS Code Extension (beta)            :active,  q3, 2026-04, 2026-08
    section Advanced Capabilities
        Incremental Diff Translation    :done,    a1, 2026-07, 2026-11
        Maven/Gradle → pyproject.toml   :         a2, 2026-08, 2026-11
        JetBrains Plugin                :done,    a3, 2026-09, 2027-01
        Multi-replica Session Store     :done,    a4, 2026-10, 2027-01
        AWS Bedrock GovCloud Adapter    :done,    a5, 2026-07, 2026-11
        Model ATO Evaluation Framework  :done,    a6, 2026-06, 2026-10
Loading
Phase Goals Target Status
v1.5 Java→Python + Requirements scaffold, 27-test suite Q2 2026 ✅ Done
v2.0 RAG-enriched translation, multi-file project translation, multi-turn session memory, 55-test suite Q2 2026 ✅ Done
v2.5 test_provider_lock.py (105 tests, 169 total), PROVIDER_LOCK IT config freeze, PRC-origin model blocklist (Qwen/DeepSeek/etc.), egress URL safety, Azure OpenAI Gov support, VS Code extension beta Q3 2026 ✅ Complete
v3.0 Incremental diff translation (/translate-diff), JetBrains plugin, multi-replica session store (Redis/Valkey), AWS Bedrock GovCloud adapter (LLM_PROVIDER=bedrock), model ATO evaluation (/evaluate-model-ato), 265-test suite Q4 2026 ✅ Complete

(back to top ↑)


🤝 Contributing

See .github/CONTRIBUTING.md for the full guide, including the Translation Quality Standards section.

📋 Quick Contribution Checklist
  1. Branch from main: git checkout -b feat/your-feature
  2. Apply NASA-style block comments (ID, Requirement, Purpose, etc.) on every new function
  3. Route all user input through sanitize() and all LLM output through validate_output()
  4. Add require_permission("your_permission") to every new endpoint
  5. Call audit.log() on both the allowed and blocked code paths
  6. Write tests: happy path + at least one edge case + at least one security negative
  7. Run the pre-commit suite:
    ruff check services/orchestrator-python/src/
    bandit -r services/orchestrator-python/src/ -ll
    pytest services/orchestrator-python/tests/ -v
  8. Open a PR using .github/PULL_REQUEST_TEMPLATE.md
  9. Two approvals required: @platform-team + @security-team for any security-touching path

Commit prefix convention: translate:, security:, fix:, feat:, docs:, test:, infra:

(back to top ↑)


📄 License & Acknowledgements

License: Internal use only. All rights reserved. Not licensed for external distribution. See .github/SECURITY.md for the vulnerability reporting policy and SLA.

Acknowledgements:

  • javalang — pure-Python Java parser that makes JVM-free class analysis possible
  • pgvector — Postgres vector extension powering the RAG index
  • nomic-embed-text — Apache 2.0 code-aware embedding model
  • FastAPI — async Python API framework with Pydantic v2 and Depends() injection
  • asyncpg — fastest async Python PostgreSQL driver
  • Ollama — self-hosted LLM inference with OpenAI-compatible API
  • vLLM — PagedAttention batching for production LLM serving
  • OWASP API Security Top 10 — threat model baseline for all endpoint design decisions
  • NIST SP 800-37 RMF / DoD RMF — framework for ATO considerations in the classified deployment guidance

(back to top ↑)

About

Air-gapped, on-prem LLM assistant for software engineering teams. No external network calls. Full audit trail. RBAC + OIDC/LDAP auth.

Topics

Resources

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors