Skip to content

Add curated MCP / external-API / library registry with security hardening #252

@kai-linux

Description

@kai-linux

Goal

Add a curated registry through which agents can be granted access to (a) MCP servers (filesystem, GitHub, Linear, Slack, Playwright, domain-specific vendor MCPs), (b) external HTTP APIs (OCR eval, cost APIs, observability APIs, cloud providers), and (c) new libraries / frameworks (e.g. pydantic-ai, instructor, dspy, a new vector store) that a repo could benefit from but currently has no pathway to discover or adopt.

Today agents get only whatever tools their underlying CLI ships with. Every integration is bespoke Python glue. There is no per-repo, operator-curated way to say "this repo may use the Linear MCP and the pydantic-ai framework" — nor any way for the system to notice "a new framework exists that would simplify what this repo is doing by hand."

Success Criteria

Phase 1 — Tool registry (MCPs + HTTP APIs)

  • New config.yaml: tool_registry section with two subsections:
    • mcp_servers: [{name, command, args, env_refs, description, default_permissions}]
    • http_apis: [{name, base_url, auth_kind (bearer|header|none), env_refs, description}]
  • Per-repo opt-in under repos.<key>.enabled_tools: [linear_mcp, playwright_mcp, receipt_eval_api]
  • Per-task-type permission scoping: tool_registry.permissions: { groomer: [linear_mcp:read], quality_harness: [receipt_eval_api:call] }
  • orchestrator/tool_registry.py exposes resolve_tools_for(repo_key, task_type) -> ToolBundle returning normalized flags for the agent adapter (--mcp-config, tool allowlist, etc.)
  • Adapters (claude, codex, gemini, deepseek) wire the bundle into their invocation; unsupported adapters log a clear "tool X not available on adapter Y" warning rather than silently dropping it
  • Credentials only via env-var references — registry never stores secrets inline; startup validation fails closed if a referenced env var is missing

Phase 2 — Library / framework awareness

  • New orchestrator/library_scout.py runs on a slow cadence (default: monthly)
  • For each repo, reads pyproject.toml / package.json and the task history to understand what the repo does (modality labels from Add quality-harness architect with modality detection and field-failure fixture loop #250 + issue titles)
  • Consults a curated library_catalog.yaml (operator-owned, at repo root) listing candidate libraries with {name, category, fits_when, url, last_verified_at}. Example categories: llm_framework, validation, evals, vector_store, browser_automation, observability
  • Emits library_suggestion findings via the same scorer → groomer pipeline with {repo, library_name, reason, fit_signals, proposed_experiment}
  • Groomer renders findings; operator approves via Telegram before a spike/experiment issue is filed (same anti-slop gate as Add system-architect agent for capability + sensor gap detection #249)

Phase 3 — Security hardening (non-negotiable)

  • Registry entries require explicit operator approval — no autonomous edits to tool_registry or library_catalog.yaml by any agent, ever
  • MCP server commands must match a pinned executable + pinned version (e.g. npx -y @modelcontextprotocol/server-github@1.2.3, never @latest)
  • Pre-flight verification on startup: each MCP server's package is checked against a verified_packages.yaml manifest with {name, version, sha256, source_url}. Mismatch fails closed with a Telegram alert
  • Library suggestions restricted to the curated library_catalog.yaml — the scout cannot propose a package that is not already in the catalog; catalog entries are added by the operator only
  • All tool invocations and library-suggestion approvals logged to the audit trail (Replace telegram_actions log with hash-chained immutable audit trail #247)
  • Outbound HTTP from agents routed through a per-repo allowlist (tool_registry.http_apis[].base_url is the only permitted origin set); unknown hosts blocked at the adapter layer where feasible
  • No dynamic pip install / npm install by agents during task execution — dependency changes still flow through normal PRs reviewed by the existing gates
  • Daily digest gains a "tool registry status" line: registered servers, verification results, any failed pre-flight checks

Acceptance test

  • Synthetic MCP with correct pinned version + sha256 passes pre-flight; mutated sha256 fails closed
  • Per-repo enabled_tools correctly narrows the bundle passed to the adapter
  • A repo without pydantic-ai in deps, where task history shows manual JSON-parsing work, produces a library_suggestion finding pointing at pydantic-ai (assuming it is in the catalog); after operator approval, groomer files a spike issue
  • A library not in the catalog never appears as a suggestion, even if deps/history would motivate it

Constraints

  • Operator-curated registries only — no autonomous additions to tool_registry or library_catalog.yaml. This is the anti-malware / anti-supply-chain bet, mirroring the Add system-architect agent for capability + sensor gap detection #249 target_operating_model.yaml pattern
  • Pinned versions + checksums required for MCP servers. @latest is banned. The system refuses to start a server whose checksum does not match the manifest
  • No secret storage in registry — env-var references only; startup validation surfaces missing creds loudly
  • Per-task-type permission scoping is required, not optional — an agent running a groomer task must not get quality-harness scopes by accident
  • Library scout is suggestion-only — never opens a PR that adds a dependency; only files a spike/experiment issue for operator approval
  • Backwards-compatible — repos with no enabled_tools continue to work exactly as today, with the adapter's default toolset

Task Type

architecture

Why

Agent-os today can only do what its LLM CLIs can do out of the box. The moment a repo needs Linear ticketing, Slack messaging, a browser, a vendor OCR eval, or a newer framework like pydantic-ai / instructor / dspy that would let it stop hand-rolling structured extraction — agent-os has no path forward. Every such integration becomes a one-off Python patch.

This is a ceiling on how far agent-os can grow. Two specific ceilings:

  1. Tool ceiling — the quality harness (Add quality-harness architect with modality detection and field-failure fixture loop #250) hits this immediately: multimodal eval often needs vendor APIs (receipt-OCR ground truth, audio transcription, image similarity). The architect (Add system-architect agent for capability + sensor gap detection #249) can detect sensor gaps but not tool gaps.
  2. Library ceiling — a repo might be hand-parsing LLM JSON outputs when pydantic-ai would do it in three lines. Today nothing in agent-os notices. As the library ecosystem shifts, agent-os ossifies on whatever was trendy the month a repo was onboarded.

The mitigation is the same anti-slop pattern as #249 and #250: the registry of known-good things is operator-owned; the detection of gaps against that registry is autonomous. Security hardening (pinned versions + checksums + no autonomous registry edits + no dynamic installs) means the blast radius of a compromised catalog entry is bounded and auditable. Without this, "extend agent capabilities" either stays a manual coding chore or opens a supply-chain door we cannot defend.

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions