Agentic Search

A retrieval-backed agent platform for multi-turn search, RAG, and RL training. Built around a FastAPI backend, interchangeable retrieval servers, and an async agent loop that supports dense/sparse hybrid retrieval, tool calling, and streaming chat.

🔍 Agentic RAG — Multi-hop retrieval with query decomposition, HyDE, hybrid reranking, and citation-grounded synthesis via AgenticRAGLoop.

🤖 Custom Agents — Compose agents from instructions, knowledge sources, tools, and memory; backed by SearchAgentLoop or ToolAgentLoop.

🌍 Web Search — Live retrieval via Google Custom Search, SerpAPI, and playwright-cli browser automation — all behind the same /retrieve API.

📚 Document Indexing — Chunk, embed, and index documents into FAISS or BM25; async background workers handle ingestion at scale.

🔗 Connectors — Pull content from local files, Google Drive, Slack, Confluence, GitHub, Jira, SharePoint, Salesforce, Zendesk, and Notion.

🛠️ Tool Use — Register Python callables or OpenAPI 3.x schemas as tools; ToolAgentLoop handles dispatch and structured output.

💬 Chat Orchestration — Streaming multi-turn chat with citation extraction, tool dispatch, context compression, and persisted sessions.

🧭 Intent Routing — Auto-classifies every query as search, chat, or tool; dispatches to the right agent loop with no configuration; RAG-Fusion multi-source aggregation in tool mode.

🖥️ React Frontend — Streaming chat UI with live SSE progress log, Markdown rendering, [D1]-format citation anchor links, per-card source expand/collapse, tool call trace panel, and intent-adaptive layout.

🧠 RL Training — GRPO/PPO training with composite shaped rewards; SearchAgentGRPOTrainer runs real agent-loop rollouts so all reward components fire during training.

📐 Bamboogle Evaluation — Benchmark SearchAgentLoop on two-hop QA with exact-match, contains-match, and shaped reward metrics; Apple Silicon (--device mps) supported out of the box.

🔌 MCP Server — Expose search, retrieval, and RAG as Model Context Protocol tools so any MCP-compatible LLM client (Claude Desktop, etc.) can query your knowledge base directly.

📊 Admin & Observability — Health, analytics, rate limits, hooks, billing, SCIM provisioning, and license state via the FastAPI admin API.

Click to open the interactive version.

Feature	Key modules
🔍 Agentic RAG	`src/agents/agentic_rag.py`, `src/context/query_enhancer.py`, `src/internal/servers/retrieval/hybrid_rerank.py`
🤖 Custom Agents	`src/agents/search.py`, `src/agents/custom.py`, `src/agents/tool_calling.py`, `src/agents/base.py`
🌍 Web Search	`src/internal/servers/web_search/google.py`, `src/internal/servers/web_search/serp.py`, `src/internal/servers/web_search/browser.py`
📚 Document Indexing	`src/internal/document_index/`, `src/internal/servers/backgroundworker/`
🔗 Connectors	`src/internal/connectors/`, `src/internal/servers/connectors/`, `src/internal/servers/oauth/`
🛠️ Tool Use	`src/tools/base.py`, `src/tools/api.py`, `src/tools/search.py`, `src/agents/tool_calling.py`
💬 Chat Orchestration	`src/internal/chat/process_message.py`, `src/internal/chat/llm_loop.py`, `src/internal/chat/citation_processor.py`, `src/internal/chat/compression.py`
🧭 Intent Routing	`src/internal/servers/web/app.py` (`_run_auto_routed`), `src/context/`
🖥️ React Frontend	`web/src/App.tsx`, `web/src/components/`, `web/src/styles.css`
🧠 RL Training	`src/training/reward.py`, `src/training/grpo.py`, `src/training/ppo/search_agent_grpo_trainer.py`
📐 Bamboogle Evaluation	`src/training/eval/bamboogle.py`, `examples/run_bamboogle_eval.py`, `bin/run_bamboogle_eval.sh`
🔌 MCP Server	`src/internal/mcp_server/tools/`, `src/internal/mcp_server/resources/`
📊 Admin & Observability	`src/internal/observability/`, `src/internal/servers/analytics/`, `src/internal/servers/reporting/`, `src/internal/servers/license/`
⚡ Retrieval Optimization	`src/internal/retrieval/query_optimizer.py`, `src/internal/retrieval/bm25_tuner.py`, `src/internal/retrieval/index_optimizer.py`, `src/internal/retrieval/fusion_learner.py`, `src/internal/retrieval/result_cache.py`
🏆 Reranking Optimization	`src/internal/retrieval/async_reranker.py`, `src/internal/retrieval/cached_reranker.py`, `src/internal/retrieval/two_stage_reranker.py`, `src/internal/retrieval/onnx_reranker.py`, `src/internal/retrieval/reranker_benchmark.py`

Repository Structure
Install · Quick Start · Frontend · Examples
Intent Routing · Features · Agentic RAG
Retrieval: Retrieval Setup · Neural Reranking · Retrieval Optimization · Query Transformation Optimization · Routing & Query Construction
HTTP APIs: Retrieval Server API · Web Backend API · Chat & Session API
Training & eval: Training · Evaluation
Ops: MCP Server · API Health Checks · Configuration · Tests · Notes

Repository Structure

src/
├── agents/                      # Agent loops (SearchAgentLoop, ToolAgentLoop, AgenticRAGLoop, …)
├── cli/                         # CLI query interface
├── context/                     # Retrieval-grounded context & prompt builders
├── model/                       # LLM generation, intent classifier, tensor helpers
├── shared_configs/              # Shared configuration dataclasses
├── tools/                       # Tool schemas, search tools, OpenAPI tool registry
├── training/
│   ├── eval/                    # Benchmark evaluation (Bamboogle, …)
│   ├── ppo/                     # PPO core, LLMGRPOTrainer, SearchAgentGRPOTrainer
│   ├── data.py                  # Training dataset builders
│   ├── grpo.py                  # GRPO advantage helpers
│   ├── reward.py                # SearchRewardFunction
│   └── sft.py                   # SFT data pipeline
└── internal/
    ├── access/                  # Access control & ACL helpers
    ├── auth/                    # Authentication & authorization
    ├── cache/                   # In-memory cache backend (chat session state)
    ├── chat/                    # Chat pipeline (loop, steps, citations, compression)
    ├── configs/                 # Environment-based configuration (AppSettings)
    ├── connectors/              # Data source connectors
    ├── context/                 # Internal retrieval context helpers
    ├── db/                      # SQLite store (AgenticSearchStore)
    ├── document_index/          # Document index (FAISS / BM25)
    ├── feature_flags/           # Feature-flag providers (env, PostHog, composite)
    ├── file_store/              # In-memory chat file handling
    ├── hooks/                   # Outbound webhook execution
    ├── llm/                     # LLM provider integrations
    ├── mcp_server/              # MCP server (tools, resources, auth)
    ├── metrics/                 # Metrics collection helpers
    ├── natural_language_processing/  # NLP utilities
    ├── observability/           # Admin surface summary & health score
    ├── prompts/                 # Prompt templates
    ├── retrieval/               # Retrieval core: service, fusion, query transforms, routers
    ├── routing/                 # Routing layer: per-query router + 6 query constructors
    ├── search/                  # Search-vs-chat flow classification
    ├── tools/                   # Internal tool registry
    ├── utils/                   # License, encryption, telemetry utilities
    └── servers/
        ├── admin_surface/       # Admin summary endpoint
        ├── analytics/           # Usage analytics API
        ├── backgroundworker/    # Async workers (beat, docfetching, light, heavy, monitoring)
        ├── billing/             # Stripe billing proxy
        ├── connectors/          # Connector management endpoints
        ├── documents/           # Connector-credential pair management
        ├── enterprise_settings/ # Enterprise configuration endpoints
        ├── evals/               # Evaluation endpoints
        ├── features/            # Feature-flag endpoints
        ├── indexing/            # Indexing status & control endpoints
        ├── license/             # License validation & seat management
        ├── limits/              # Usage limit enforcement
        ├── middleware/          # License enforcement, tier gate, tenant tracking
        ├── oauth/               # OAuth 2.0 connector authorization
        ├── query_and_chat/      # Search and chat endpoints
        ├── query_history/       # Query history & export
        ├── reporting/           # Usage report ZIP generation
        ├── retrieval/           # Dense/sparse/rerank server entry points
        ├── scim/                # SCIM 2.0 user & group provisioning
        ├── settings/            # Settings endpoints
        ├── tenants/             # Multi-tenant provisioning & management
        ├── token_rate_limits/   # Per-user token rate limiting
        ├── user_group/          # Group management
        ├── users/               # User management
        ├── web/                 # FastAPI app assembly
        └── web_search/          # Web search servers (Google, SerpAPI, browser)
bin/                             # Shell helpers (eval, training data generation)
tests/                           # Unit and integration test suites
examples/                        # Runnable CLI examples

The FastAPI app is assembled in src/internal/servers/web/app.py. Every feature area is a self-contained router factory. AgenticSearchStore (SQLite) is the single persistence layer — no Postgres, Redis, or Celery required locally.

Install

Requires Python 3.10+.

pip install -e .               # makes src importable as a package
pip install -r requirements.txt

For MCP server support:

pip install -e ".[mcp]"

For BM25 (pyserini), Java must be available on PATH. Set JAVA_HOME if needed.

Env vars — copy .env.example to .env (loaded automatically via python-dotenv):

# LLM provider (required for agent loops)
GEN_AI_MODEL_PROVIDER=openai       # openai | anthropic | ollama | litellm
GEN_AI_MODEL_VERSION=gpt-4o-mini
GEN_AI_API_KEY=...
GEN_AI_API_BASE=...                # optional override (e.g. http://localhost:11434/v1)

# Web search (pick one or more)
GOOGLE_API_KEY=...
GOOGLE_CSE_ID=...
SERP_API_KEY=...

# Optional
JAVA_HOME=/path/to/java            # for BM25 / pyserini

Quick Start

Three processes, each in its own terminal:

Retrieval service — http://localhost:8001

python3 -m src.internal.servers.retrieval.demo --corpus_path data/corpus.jsonl

Web API — http://localhost:7860

PYTHONPATH=src:. uvicorn src.internal.servers.web.app:app --host 127.0.0.1 --port 7860

Frontend — http://localhost:5173

cd web && npm install && npm run dev

Open http://127.0.0.1:5173. Vite proxies /api/* to the web API on port 7860. For production, npm run build produces web/dist; the FastAPI app serves it automatically.

Search Agent mode (optional — local MPS inference)

The UI has a fifth mode "Search Agent (Local Model)" that runs SearchAgentLoop in-process. To enable it, set SEARCH_AGENT_MODEL before starting the web API:

# 8 GB RAM
SEARCH_AGENT_MODEL=Qwen/Qwen2.5-0.5B-Instruct PYTHONPATH=src:. uvicorn src.internal.servers.web.app:app --host 127.0.0.1 --port 7860

# 16 GB RAM (better quality)
SEARCH_AGENT_MODEL=Qwen/Qwen2.5-1.5B-Instruct PYTHONPATH=src:. uvicorn src.internal.servers.web.app:app --host 127.0.0.1 --port 7860

Or use bin/run_web_stack.sh which reads SEARCH_AGENT_MODEL from .env and starts all three processes in one command (~30–60s first response on MPS).

Frontend

The web/ directory contains a React 19 + Vite + TypeScript single-page app. It runs against the FastAPI backend at port 7860 and proxies /api/* through Vite in development.

cd web && npm install && npm run dev   # dev server at http://127.0.0.1:5173
cd web && npm run build                # production bundle → web/dist/ (served by FastAPI)
cd web && npm run typecheck            # TypeScript check
cd web && npm run test -- --run        # Vitest unit tests

UI features

Streaming answers (AnswerPanel.tsx → ProgressLog) — every query streams over SSE; streamAgent (web/src/api.ts) drives the UI from the progress / answer / done events (full schema in the SSE event table). While the agent runs, a live Agent reasoning log renders one row per turn (⟳ Turn N · writing answer… active, ✓ Turn N · <tool> · N docs completed) and answer tokens stream in as markdown; on done the log collapses to a one-line summary (✓ 3 turns) with a show reasoning ▸ toggle that re-expands the full trace. Backend side, each turn fires the on_turn callback (OnTurnCallback) → a progress event, while token / tool-call / citation packets originate from AgentQueueManager → Emitter. The New button (handleNewSession) aborts any in-flight request and clears answer / citations / documents / messages / intent; an in-flight turn is cancellable via the stop-signal fence.

Markdown rendering — Answers render via react-markdown: headings, bold/italic, inline code, code blocks, and ordered/unordered lists. Citation markers ([D1], [D2], …) become anchor links that scroll the page to the matching source card.

Chat history — Session timeline renders as a chat bubble layout: user messages right-aligned, assistant messages left-aligned. System messages are filtered out. Keys are stable against message prepend/removal.

Source cards (SourceGrid.tsx) — SourceGrid is a thin mapper over a controlled SourceCard (memoised, per-document, owning its own expanded / copied state). Each card renders one SourceDocumentView ({ id, citation, title, content, url, score, metadata }) and:

collapses content to 3 lines by default (source-content--clamped); show more ▾ / show less ▴ toggles per card.
a ⎘ copy button copies the full content and flips to "copied ✓" for 1.5 s.
carries id="source-{citation}" so [D1]-style anchor links from the answer scroll to it.
color-codes the relevance score via scoreColor() (green ≥ 0.7, amber ≥ 0.4, orange > 0, grey for 0).
tags the source provider with a colored pill via SOURCE_COLORS (Browser Retrieval, SerpAPI, Local Retrieval, All Active Sources; grey fallback).

Source cards are frontend-only (no dedicated backend endpoint): they are populated from the documents array of the POST /api/agent response (see Web Backend API); the retrieval server returns the same fields as results[] from POST /search. Inspect that backing data with:

curl -s -X POST http://localhost:7860/api/agent \
  -H "Content-Type: application/json" \
  -d '{"query": "What is FAISS?", "top_k": 3}' \
  | python -c "import sys, json; [print(d['citation'], round(d['score'],2), d['title']) for d in json.load(sys.stdin)['documents']]"
# → [D1] 0.81 FAISS: A Library for Efficient Similarity Search ...

Tool Call Trace Panel — When the agent runs in tool mode, a panel below the answer shows every tool call: name, status (✓ / ✗), arguments as JSON, result summary (first 200 chars or "N items" for lists), and latency in ms. Failed calls render with a red border and the error message.

Intent-adaptive layout — App.tsx reads response.intent (set from the done SSE event via setIntent) and applies intent-${intent} to the .results-layout container; when intent is undefined no class is added and the layout falls back to the default single-column stack. The behaviour is CSS-only — styles.css rules consume the class to reflow the existing panels (no extra components), keyed off stable hooks .answer-column, .sources-panel, .session-panel, and .tool-trace-panel:

Intent	`.results-layout` class	Layout
`search`	`intent-search`	Single column; `.sources-panel` gets a highlighted border; `.session-panel` dimmed
`chat`	`intent-chat`	`.answer-column` + `.session-panel` side-by-side (≥720 px); `.sources-panel` full-width below
`tool`	`intent-tool`	`.tool-trace-panel` full-width hero; `.sources-panel` and `.session-panel` side-by-side below
narrow (≤720 px)	—	All intents fall back to a single-column grid stack

The intent itself comes from the backend's routing decision — see the response.intent contract under Web Backend API. No new endpoints back this feature; the layout is a pure function of that one field.

Intent badge (AnswerPanel.tsx) — a pill under the answer summarising what ran, derived from response.intent + counts: Searched · 5 sources, Answered · 3 citations, or Used tools. Hidden when the answer is empty or the intent is undefined.

Example-query chips (SearchComposer.tsx) — three chips under the search box, one per routing intent, that populate and run a representative query in a single click so the intent router can be exercised without knowing what triggers each path: 🔍 find the onboarding checklist (search), 💬 explain how FAISS indexing works (chat), 🛠 summarize the latest sales figures and chart them (tool). The chips are hidden while a request is in flight.

Components (web/src/components/) — each panel is a focused, independently tested unit:

Component	What it does
`SearchComposer`	Single input box (no mode selector), per-intent example-query chips, source-provider / retrieval-URL / top-K controls, Cmd+Enter submit
`AnswerPanel`	Streamed markdown answer + intent badge + `[D1]` citation anchor links
`SourceGrid`	Expand/collapse source cards with copy-to-clipboard and citation `id` anchors
`SessionTimeline`	Chat-bubble history (user right, assistant left; system filtered)
`ToolCallTracePanel`	Per-tool-call trace (name, ✓/✗ status, JSON args, result summary, latency) for `tool` intent
`AdminOverview`	Single-call health snapshot — connectors, indexing, users, auth, models, tools, analytics with a composite health score
`AnalyticsDashboard`	Usage breakdowns by LLM, persona, and flow (`getAnalyticsBy*`)
`ConnectorPanel`	Lists configured connectors and their sync/index status
`QueryHistoryPanel`	Per-user query history with CSV export (`getQueryHistory`)
`ToolPanel`	Admin view of MCP/OpenAPI tools registered via `tool_registry`

API client functions live in web/src/api.ts: runAgent / streamAgent (SSE), createSession / getSession, getAdminSummary, getAnalyticsByLLM / getAnalyticsByPersona / getAnalyticsByFlow, getQueryHistory, getAuditSummary, submitFeedback.

Feedback loop (UI → fine-tuning) — submitFeedback(chatMessageId, isPositive, feedbackText?) posts per-message like/dislike to POST /chat/create-chat-message-feedback, and session thumbs go to POST /api/feedback; QueryHistoryPanel can filter sessions by feedback_type (like / dislike). These ratings are exactly what load_feedback_examples reads back into feedback-driven GRPO — the human-feedback signal that fine-tunes the policy.

Intent Routing

The backend auto-classifies every query and dispatches to the right agent without any configuration:

Intent	Agent loop	Trigger
`search`	`SearchAgentLoop`	Query needs external retrieval (web or indexed docs)
`chat`	`PlainGenerationLoop`	Conversational follow-ups, definitions, open-ended questions
`tool`	`ToolAgentLoop`	Explicit tool use (`search_routing_tool`, custom tools)

The router is _run_auto_routed in src/internal/servers/web/app.py. It runs an LLM-backed classifier (classify_is_search_flow) and falls back to chat on ambiguous input.

RAG-Fusion in tool mode — search_routing_tool aggregates results from all configured retrieval sources (local index, Google, SerpAPI) in a single call, deduplicates by URL, and returns a ranked list with [D1]/[D2] citation labels.

SSE streaming with progress events — All three agent paths emit SSE events:

Event type	When emitted	Payload
`progress`	Each agent turn	`{type, turn, text}`
`answer`	Answer token chunks	`{type, text}`
`done`	Stream complete	`{type, session_id, citations, documents, intent, tool_calls}`
`error`	Unhandled exception	`{type, detail}`

The on_turn callback (OnTurnCallback in src/agents/base.py) is the hook that feeds per-turn events into the SSE queue from inside the agent loop.

Examples

Agent CLI

Mode	Loop	Needs retrieval server	Use it for
`single`	`PlainGenerationLoop`	No	Local generation smoke tests
`search`	`SearchAgentLoop`	Yes	Multi-turn RAG, SFT, and RL traces
`tool`	`ToolAgentLoop`	Yes	Structured tool-calling experiments

# single — no retrieval server needed (plain generation)
# Apple Silicon: use --device mps --allow_unsafe_mps for ~50x faster inference
python3 -m examples.run_agentic_search \
  --mode single --question "What is FAISS?" \
  --model Qwen/Qwen2.5-1.5B-Instruct --local --device mps --allow_unsafe_mps \
  --allow_remote_model_downloads

# single with retrieval server — small models (≤3B) use --mode single; search/tool require 7B+ to emit structured tags
python3 -m examples.run_agentic_search \
  --mode single --question "What is FAISS?" \
  --model Qwen/Qwen2.5-1.5B-Instruct --local --device mps --allow_unsafe_mps \
  --search_url http://localhost:8001/retrieve --allow_remote_model_downloads

# search — 3B is the Mac sweet spot (~6 GB unified memory); 7B needs 16 GB+ and will swap
python3 -m examples.run_agentic_search \
  --mode search --question "What is RAG?" \
  --model Qwen/Qwen2.5-3B-Instruct --local --device mps --allow_unsafe_mps \
  --search_url http://localhost:8001/retrieve --allow_remote_model_downloads

# search — server-backed, requires vLLM on :8080 and retrieval on :8001
python3 -m examples.run_agentic_search \
  --mode search --question "Compare dense and sparse retrieval" \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --vllm_url http://localhost:8080 --search_url http://localhost:8001/retrieve

Bamboogle evaluation (always requires retrieval server on :8001)

# Smoke test — local model, 1 example, full trace printed
python3 -m examples.run_bamboogle_eval \
  --model Qwen/Qwen2.5-3B-Instruct --local --device mps --allow_unsafe_mps \
  --search_url http://localhost:8001/retrieve --limit 1 --print_trace \
  --allow_remote_model_downloads

# Full benchmark — Apple Silicon, requires SERP_API_KEY in .env
bin/run_bamboogle_eval.sh --limit 125

PPO/GRPO reward

python3 -m examples.run_grpo_training_pipeline         # end-to-end reward + GRPO (no GPU)

Dataset preparation

# Search-QA parquet
python3 -m examples.prepare_search_qa_dataset \
  --dataset_name RUC-NLPIR/FlashRAG_datasets --dataset_config nq --local_dir data/nq_search

# Preview before writing
python3 -m examples.prepare_search_qa_dataset \
  --dataset_name RUC-NLPIR/FlashRAG_datasets --dataset_config nq \
  --splits test --max_examples 20 --preview --preview_rows 5

# RAG parquet from cached retrieval results
python3 -m examples.prepare_search_rag_dataset \
  --dataset_name RUC-NLPIR/FlashRAG_datasets --dataset_config nq \
  --corpus_path data/wiki-18.jsonl \
  --train_retrieval_cache data/nq_train_retrieval_cache.json \
  --test_retrieval_cache data/nq_test_retrieval_cache.json \
  --topk 3 --local_dir data/nq_rag

Search pipeline with access filters (no live model or retrieval server required)

python3 -m examples.run_search_pipeline

Features

Retrieval, Indexing & Search

Hybrid + rerank — dense (FAISS/E5) + sparse (BM25) RRF fusion with cross-encoder reranking in a single /retrieve endpoint
QueryEnhancer (src/context/query_enhancer.py) — base query-transformation primitives: decompose() (2–4 sub-queries), hyde() (hypothetical answer), step_back() (broader reformulation), and enhance() which runs all three into a QueryBundle. Every method is fallback-safe — it returns the original query / None when no LLM is configured
expand_keywords (src/internal/servers/secondary_llm_flows/query_expansion.py) — LLM keyword/synonym expansion for the BM25 leg; the QT_KEYWORDS branch of QueryTransformPipeline
Reranker (src/internal/retrieval/reranker.py) — unified neural reranker supporting local cross-encoders (BAAI/bge-reranker-v2-m3, cross-encoder/ms-marco-*) and Cohere v3/v4 API; built via Reranker.from_env(); injected into RetrievalService; skipped when RERANKER_PROVIDER is unset; appends +reranked to retrieval_mode
AsyncReranker (src/internal/retrieval/async_reranker.py) — wraps any reranker in a ThreadPoolExecutor; raises RerankerTimeoutError when RERANKER_TIMEOUT_MS is exceeded; exposes arerank() for async callers
CachedReranker (src/internal/retrieval/cached_reranker.py) — Redis-backed score cache keyed on sha256(query:sorted_doc_ids:k=top_k); stats() returns hits/misses/hit_rate; from_env() returns base unchanged when RERANKER_CACHE_REDIS_URL is unset
TwoStageReranker (src/internal/retrieval/two_stage_reranker.py) — fast pre-filter over all N candidates, heavy scorer over top M; both legs independently wrapped; enabled via RERANKER_TWO_STAGE=true
ONNXReranker (src/internal/retrieval/onnx_reranker.py) — drop-in replacement using optimum.onnxruntime; falls back to PyTorch Reranker on ImportError; enabled via RERANKER_USE_ONNX=true
PassageTruncator (src/internal/retrieval/passage_truncator.py) — whitespace-token truncation applied before scoring; zero-dependency; configurable via RERANKER_MAX_TOKENS (0 = disabled)
RerankerBenchmark (src/internal/retrieval/reranker_benchmark.py) — offline CLI grid search over model × batch_size × max_tokens; writes JSONL output and prints a ranked table sorted by NDCG@k
QueryTransformPipeline (src/context/query_transform.py) — composes decompose, HyDE, step-back, keyword expansion, and filter extraction behind one interface, producing a TransformedQueryBundle; bundle.retrieval_variants(max_variants) deduplicates the variants and always keeps the original query last, and RetrievalService retrieves each variant in parallel then fuses with rrf_fuse; all QT_* env vars default to false (zero overhead when disabled); appends +rag_fusion to retrieval_mode. Refactored to expose _build_jobs/_assemble and a per-query config_override, plus the module helper config_signature(), so the wrappers below can compose on top of the unchanged leaf
AsyncQueryTransformPipeline (src/internal/retrieval/async_query_transform.py) — runs the leaf's transform calls (decompose, HyDE, step-back, keywords, filter construction) concurrently in a ThreadPoolExecutor; a transform that exceeds QT_TRANSFORM_TIMEOUT_MS or raises degrades to its empty default rather than failing the request; enabled via QT_ASYNC=true
CachedQueryTransformPipeline (src/internal/retrieval/cached_query_transform.py) — Redis-backed bundle cache keyed on sha256(query|config_signature); caches the filter-free bundle and re-merges caller filters per call (no cross-caller leakage); stats() returns hits/misses/hit_rate; from_env() returns base unchanged when QT_CACHE_REDIS_URL is unset
MultiQueryGenerator (src/internal/retrieval/multi_query.py) — true Multi-Query retrieval: one LLM call produces N paraphrased reformulations (distinct from decompose's sub-questions); surfaced as the multi_query field on TransformedQueryBundle; enabled via QT_MULTI_QUERY=true (QT_MULTI_QUERY_N controls N)
variant_weighted_rrf_fuse / dedup_variants (src/internal/retrieval/fusion.py) — weighted RAG-Fusion across N variant result sets (original query weighted highest) gated by QT_FUSION_WEIGHTED; embedding-cosine dedup that drops near-duplicate variants before retrieval gated by QT_SEMANTIC_DEDUP (dormant until a dense backend exposes a batch embed())
QueryRouter / RoutedQueryTransformPipeline (src/internal/retrieval/query_router.py, routed_query_transform.py) — per-query learned routing: predicts which transforms to enable from a serialized scikit-learn artifact (QT_ROUTER_MODEL_PATH) with a rule-based heuristic fallback when no artifact is present; the wrapper threads the predicted config down the chain as config_override; enabled via QT_ROUTER=true. Train the artifact offline with src/training/train_query_router.py
build_query_transform_pipeline_from_env (src/internal/retrieval/query_transform_factory.py) — composes the active layers RoutedQueryTransformPipeline → CachedQueryTransformPipeline → AsyncQueryTransformPipeline → QueryTransformPipeline, skipping any whose flag is unset; returns None (single-query path, zero overhead) when no QT_* flag is set
QueryTransformBenchmark (src/internal/retrieval/query_transform_benchmark.py) — offline grid over technique-combination configs × a labeled dataset; run_query_transform_benchmark() reports recall@k / NDCG@k (reusing eval_metrics) plus mean transform latency per config, sorted by recall
qt_slo_exceeded (src/internal/retrieval/eval_runner.py) — P99 transform-latency gate; eval_runner --qt-slo-ms N records per-query qt_latency_ms and exits non-zero when P99 exceeds the budget
QueryConstructor (src/internal/retrieval/query_constructor.py) — NL → metadata filter extraction; with QT_CONSTRUCT_OPERATORS=true it additionally emits numeric comparison filters (rating_gte, rating_lte) beyond equality and date ranges
QueryOptimizer (src/internal/retrieval/query_optimizer.py) — acronym expansion (data/query/acronyms.json), WordNet synonym injection, and symspellpy spell correction applied to the BM25 leg only; enabled via QUERY_EXPANSION_ENABLED / SPELL_CORRECTION_ENABLED
BM25Tuner (src/internal/retrieval/bm25_tuner.py) — grid search over (k1, b) against labeled QA pairs; results written to data/eval/bm25_params.json; BM25+ variant (δ=1.0) enabled via BM25_VARIANT=bm25plus
FAISSIndexBuilder (src/internal/retrieval/index_optimizer.py) — builds IVF-PQ indexes (nlist=4096, m=96, nbits=8, nprobe=64) cutting memory from ~30 GB to ≤ 4 GB at 10 M docs; HNSWTuner finds minimum ef_search meeting a recall target; EmbeddingBatcher coalesces concurrent embed calls within a 5ms window
FusionLearner (src/internal/retrieval/fusion_learner.py) — fits per-source RRF weights (w_sparse, w_dense) offline; loaded at startup from FUSION_WEIGHTS_PATH; falls back to uniform weights when absent; adaptive_mmr_lambda selects λ by query length when ADAPTIVE_MMR=true
ResultCache (src/internal/retrieval/result_cache.py) — Redis-backed full SearchResponse cache keyed on canonicalized query + filters + top_k; TTL via RESULT_CACHE_TTL; hit/miss stats surfaced via GET /api/admin/retrieval/stats
graph_rag_search (src/internal/retrieval/graph_rag.py) — GraphRAG retrieval: extract_entities + build_entity_graph build an EntityGraph over the top retrieved passages, then re-rank by entity connectivity; served by POST /internal/search/graph (retrieval_mode: "graph")
CachedEmbedder / EmbeddingBatcher (src/internal/retrieval/embedding_cache.py) — query-embedding cache keyed on sha256(query) plus a batcher that coalesces concurrent embed calls within a short window to cut redundant encoder passes
RetrievalService (src/internal/retrieval/service.py) — the retrieval core behind the HTTP server: composes sparse + dense backends → RRF fusion → MMR → optional reranker → optional query-transform pipeline; from_env() wires every optimization layer from QT_* / RERANKER_* / cache env vars; exposes search(query, top_k, filters) returning (results, retrieval_mode)
Offline evaluation — run_beir_eval (beir_eval.py) scores BM25/dense/hybrid against BEIR datasets (NDCG/MRR/Recall); run_ragas_eval (ragas_eval.py) scores end-to-end RAG answers (faithfulness, answer/context relevancy) via build_ragas_dataset; eval_runner.py is the CI gate (NDCG/MRR/MAP + latency SLO)
Local dense retrieval with FAISS-compatible indexes (E5, BGE, custom embedders)
Local sparse retrieval with BM25/Pyserini
Web search via Google Custom Search, SerpAPI, and playwright-cli
FAISS and BM25 index builders from a JSONL corpus (src/internal/document_index/index_builder.py)
Background indexing pipeline — async workers fetch, parse, chunk, enrich, embed, and index; supports mini-chunks, vector-write retries, and document prefiltering
Connectors (src/internal/connectors/) — collect documents from multiple sources:
- LocalFileConnector / LocalFilePollConnector — UTF-8 files from paths, directories, or globs
- SearchConnector — search results as documents via retrieval, Google, or SerpAPI
- WebConnector / RSSConnector — web page scraping and RSS feed ingestion
- InMemoryConnector — Python objects for testing and prototyping
- OAuthConnector — base class for authorization-code OAuth flows (Google Drive, Slack, Confluence, GitHub, Jira, SharePoint, Salesforce, Zendesk, Notion)
- PollConnector / CheckpointedConnector / SlimConnector — base classes for incremental sync with time-window, checkpoint, and permission-metadata variants

Agent Loops

Agentic RAG (AgenticRAGLoop) — multi-hop query decomposition, HyDE, iterative retrieval with evidence sufficiency gating, and grounded synthesis with citations
Multi-turn SearchAgentLoop traces with <think>, <search>, <information>, and <answer> actions
ToolAgentLoop — generic tool-calling loop usable from both search and chat flows; emits action_trace (newline-delimited JSON of every ToolExecutionResult) for downstream parsing and display
OnTurnCallback — async hook called after each agent turn with (turn, tool_name, doc_count); wired through SearchAgentLoop, ToolAgentLoop, and PlainGenerationLoop; used by the web backend to forward live progress events over SSE
BaseAgent (src/agents/graph_base.py) — Pydantic-based agent base class; lightweight alternative to LangGraph for custom agent workflows with invoke()-compatible interface

LLM Backends

OpenAICompatibleLLM — single client for OpenAI, Azure OpenAI, Anthropic, Ollama, LiteLLM, and vLLM (src/internal/llm/providers.py)
Server-backed inference via any OpenAI-compatible endpoint (--vllm_url)
In-process HuggingFace models on CPU, CUDA, or MPS (--local --device)
Configured via GEN_AI_MODEL_PROVIDER, GEN_AI_MODEL_VERSION, GEN_AI_API_KEY, GEN_AI_API_BASE

Tool Use

Hermes, Llama-3, and JSON tool-call parsers
ApiToolRegistry — load and execute tools from any OpenAPI 3.x schema at runtime
FunctionTool — wrap any Python callable with auto-generated JSON schema
build_search_tool — ready-made tool dispatching to retrieval, Google, or SerpAPI
ToolCallView (src/internal/servers/web/app.py) — response model for each tool call: tool_name, status, arguments (dict), result_summary (first 200 chars or "N items"), latency_ms, error; returned as AgentExperienceResponse.tool_calls for intent == "tool" requests

Chat Processing

build_chat_turn — top-level orchestrator: resolves persona, tools, files, and LLM; dispatches to run_llm_loop; persists via save_chat_turn (src/internal/chat/process_message.py)
run_llm_loop — multi-turn loop: message history, tool dispatch, context injection, token streaming
run_llm_step — single LLM step: prompt → stream → extract tool calls → LlmStepResult
DynamicCitationProcessor — streams tokens and extracts citation markers in REMOVE / KEEP / HYPERLINK modes
compress_chat_history — summarises older turns when context exceeds the token budget; branch-aware
Emitter — routes packets (tokens, tool calls, citations) from worker threads to the HTTP stream
build_system_prompt — assembles system prompt from persona, tools, knowledge, and memory context
AgentQueueManager (src/internal/chat/queue_manager.py) — thread-safe queue that funnels AgentThought packets (token deltas, tool calls, citations, QueueEvent markers) from worker threads to the SSE stream; the backbone of streamed chat
ChatStateContainer / ChatTurnSetup / AvailableFiles (src/internal/chat/chat_state.py) — per-turn chat state: resolved persona, tools, uploaded files, and message history assembled once per turn
maybe_emit_argument_delta + Parser (src/internal/chat/tool_call_args_streaming.py) — incrementally parses and streams tool-call argument deltas so tool inputs render live as the model emits them
Stop / cancel signalling (src/internal/chat/stop_signal_checker.py) — set_fence / is_connected / reset_cancel_status use a Redis fence keyed by session to abort an in-flight turn when the client disconnects or hits Stop
compress_chat_history token-budget policy is documented in src/internal/chat/COMPRESSION.md

Cache & Persistence

AgenticSearchStore (SQLite) — connectors, documents, permissions, chat sessions, indexing attempts, usage reports, rate limits, SCIM tokens, standard answers (src/internal/db/store.py)
Search history per user (GET /search/search-history) and query history with CSV export (GET /admin/query-history/export)
InMemoryCache — in-flight chat session state (processing flag, stop signal, cancel) during streaming
ChunkBatchStore — temp disk buffer decoupling embedding from index insertion for large jobs (src/internal/servers/indexing/chunk_batch_store.py)
InMemoryChatFile — uploaded files (images, PDFs, text) held in memory for one chat turn

Prompts

Chat prompt constants — citation reminders, system prompt defaults, file/image/tool templates (src/internal/prompts/chat_prompts.py)
KEYWORD_EXPANSION_PROMPT / QUERY_TYPE_PROMPT — broaden sparse queries and classify intent for retrieval tuning
Binary search/chat classification prompt with labelled examples and strict single-word output
Agentic RAG prompts — decompose (2–4 sub-questions) and HyDE (hypothetical ideal answer) for QueryEnhancer
build_search_agent_instruction — assembles the ReAct-style system prompt for SearchAgentLoop (src/agents/search.py)

RL Training

Composite reward shaping (SearchRewardFunction) — correctness, format compliance, citation support, unnecessary-fetch penalty, and fetch-usefulness reward components
Group-relative advantage helpers for PPO, GRPO, and REINFORCE-style experiments
PPO core: clipped policy loss, value loss, entropy, KL penalty, adaptive and fixed KL controllers
LLMGRPOTrainer — online GRPO for any HuggingFace causal-LM; rolls out G completions per prompt, scores with judge_fn + SearchRewardFunction, and updates with PPO-clip + KL penalty (src/training/ppo/llm_grpo_trainer.py)
SearchAgentGRPOTrainer — extends LLMGRPOTrainer with real SearchAgentLoop rollouts to unlock the full shaped-reward signal (citations, search quality, fetch usefulness) (src/training/ppo/search_agent_grpo_trainer.py)
Feedback-driven GRPO — load_feedback_examples(db_path, min_ratings=10) (src/training/data.py) reads thumbs-up/down sessions from AgenticSearchStore (the retrieval_feedback table fed by POST /api/feedback) into PromptTrainingExamples with metadata["human_signal"] = +1.0 / -1.0. SearchRewardFunction adds a human_feedback reward component weighted by SearchRewardConfig.human_feedback_weight (default 0.0 → zero regression on existing presets); SearchAgentGRPOTrainer threads human_signal from batch metadata into the score. Closes the loop: user feedback → reward signal → policy update
SFT warm-start (src/training/sft.py) — SFTTrainer / SFTConfig (epochs=3, lr=2e-5) supervised-fine-tune a base model on agent traces before GRPO, so RL starts from a competent policy rather than cold-exploring. load_sft_examples(db_path, jsonl_path=None, min_ratings=1) (src/training/data.py) merges thumbs-up sessions from AgenticSearchStore with optional JSONL pairs ({"question", "response"}) into list[SFTExample] (built via build_search_sft_example). Loss is cross-entropy on assistant tokens only — system / user / tool-result tokens are masked to -100 so the model imitates only the agent's own actions. Two-phase via examples/run_sft_grpo.py: Phase 1 SFT → intermediate checkpoint (data/checkpoints/sft_warmstart/) → Phase 2 GRPO loads it with SearchAgentGRPOTrainer.from_pretrained(...); --sft_epochs 0 skips straight to GRPO with no code-path change
Training data builders for search-QA and RAG parquet datasets (src/training/data.py)
bin/generate_training_data.sh — one-command parquet generation for Bamboogle, NQ, TriviaQA, and HotpotQA; --preview mode prints sample records without writing

Intent Routing & Query Transformations

Auto-routing (_run_auto_routed) — single entry point that classifies every query as search, chat, or tool and dispatches to the right agent loop; no per-query configuration needed
RAG-Fusion — search_routing_tool in tool mode aggregates across all configured retrieval sources, deduplicates by URL, and returns [D1]/[D2]-labelled results
Query decomposition (QueryEnhancer.decompose) — splits complex questions into 2–4 independent sub-queries for parallel retrieval
HyDE (QueryEnhancer.hyde) — generates a hypothetical ideal answer to expand sparse queries before retrieval
Step-back prompting — reformulates narrow questions into broader conceptual queries
Multi-Query retrieval (MultiQueryGenerator) — one LLM call yields N paraphrased reformulations retrieved in parallel and fused, distinct from decomposition's sub-questions
Weighted RAG-Fusion (variant_weighted_rrf_fuse) — RRF across all variant result sets with the original query weighted highest; optional pre-retrieval semantic dedup of near-duplicate variants
Canonical query rewrite (QueryEnhancer.rewrite, QT_REWRITE) — one normalized rewrite that fixes typos and strips verbosity while preserving meaning, distinct from step-back's broadening; threaded through the bundle/router as the 7th transform label
Learned query routing (QueryRouter) — predicts the per-query transform set (7 labels: decompose, hyde, step_back, keywords, construct_filters, multi_query, rewrite) from a scikit-learn artifact with a rule-based heuristic fallback, so cheap/keyword queries skip expensive transforms
Keyword extraction — strips conversational noise from queries before BM25 retrieval
Search vs chat (classify_is_search_flow) — LLM-backed binary router; defaults to chat on ambiguous input (src/internal/servers/secondary_llm_flows/search_flow_classification.py)
Intent classifier (IntentPipeline) — trainable feedforward ML model classifying purchase / navigate / qa / recommendation; selects fast / balanced / reasoning model tier (src/model/intent_classifier.py)

Routing Layer & Query Construction (src/internal/routing/, default-off behind ROUTING_ENABLED)

Per-query Router (router.py) — routes each query to a domain → source(s) → retriever target, emitting a RouteDecision. Heuristic strategy by default (no LLM); optional logical (LLM structured-classification) and semantic (embedding-similarity over route descriptions) strategies, each falling back to the heuristic. Backed by a config-driven RouteRegistry (ROUTING_REGISTRY_PATH) so domains aren't hardcoded
Six query constructors behind one construct(query, route) -> ConstructedQuery interface (construction/): Metadata Filter (wraps QueryConstructor), Vector Search params, Hybrid fusion config (reuses adaptive_mmr_lambda); plus net-new SQL (schema-aware Text-to-SQL, SELECT-only + table allowlist), Knowledge Graph (read-only Cypher templating, word-boundary write-clause rejection), and API Request (NL → allowlisted request params)
Construct-only safety — the three net-new constructors build and validate a query but never execute it (no SQL/KG/API backend); RetrievalService short-circuits those targets to empty results, so routing to them never touches a live system. Every route()/construct() is fallback-safe (degrades, never raises)
Routing-accuracy gate (routing_accuracy, eval_runner --routing-eval) — scores the router's top-1 retriever pick against a labeled data/eval/routing_labels.jsonl

Observability & Feature Flags

build_admin_surface_summary — single-call health snapshot: connectors, indexing, users, auth, models, tools, analytics, enterprise controls with a composite health score
MonitoringWorker — background poller for process memory (RSS), index queue depth, connector count; ships JSON snapshots to a cloud data-plane URL
event_telemetry / identify_user — PostHog event capture helpers; no-ops when PostHog is not configured
Feature flags — composable chain: EnvFeatureFlagProvider → PostHogFeatureFlagProvider; StaticFeatureFlagProvider for tests; single call-site via is_feature_enabled

Agentic RAG

chat_loop is the web API name for AgenticRAGLoop — web modes are named by session behavior, not retrieval strategy. Valid modes: search_tool, hybrid_search, chat_once, chat_loop.

curl -X POST http://localhost:7860/api/agent \
  -H "Content-Type: application/json" \
  -d '{"query": "What is FAISS?", "mode": "chat_loop", "top_k": 5}'

Loop flow:

Query enhancement — decompose into sub-queries; generate HyDE hypothetical answer
Hybrid+rerank retrieval — retrieve per enhanced query; accumulate unique documents
Sufficiency check — LLM judges if context is enough; break or continue
Follow-up generation — LLM proposes targeted follow-up queries if insufficient
Grounded synthesis — answer from all accumulated evidence with inline citations

Retrieval Setup

src.internal.document_index is the single indexing entry point — filtering, chunking, embedding, retry-isolated writes, and failure reporting. Query-time retrievers and the retrieval HTTP client live in src.context. Reranker utilities live in src.internal.servers.retrieval.

Retrieval servers (src/internal/servers/retrieval/):

Module	Description
`demo.py`	TF-IDF over corpus.jsonl — no Java required
`retrieval_server.py`	BM25 or dense (E5/BGE via FAISS)
`retrieval_rerank.py`	Retrieval + cross-encoder reranker
`rerank.py`	Standalone cross-encoder reranker (no retrieval)
`hybrid_rerank.py`	Dense + BM25 RRF fusion + rerank (recommended for `AgenticRAGLoop`)

Web search servers (src/internal/servers/web_search/):

Module	Description
`google.py`	Google Custom Search proxy
`serp.py`	SerpAPI proxy
`browser.py`	playwright-cli browser automation; no API key, ~5–10s/query

Start a retrieval server:

# Dense (E5)
python3 -m src.internal.servers.retrieval.retrieval_server \
  --model_path intfloat/e5-base-v2 --index_path data/indexes/e5_Flat.index \
  --corpus_path data/corpus.jsonl --retrieval_method e5 --device cpu --topk 5

# Sparse BM25
python3 -m src.internal.servers.retrieval.retrieval_server \
  --index_path data/indexes/bm25 --corpus_path data/corpus.jsonl --retrieval_method bm25

Build indexes:

python3 -m src.internal.document_index.index_builder \
  --retrieval_method e5 --model_path intfloat/e5-base-v2 \
  --corpus_path data/corpus.jsonl --faiss_type Flat --save_dir data/indexes/

python3 -m src.internal.document_index.index_builder \
  --retrieval_method bm25 --corpus_path data/corpus.jsonl --save_dir data/indexes/

Hybrid + rerank:

python3 -m src.internal.servers.retrieval.hybrid_rerank \
  --dense_model intfloat/e5-base-v2 --index_path data/indexes/e5_Flat.index \
  --corpus_path data/corpus.jsonl \
  --sparse_index_path data/indexes/bm25 --hybrid_alpha 0.5 \
  --retrieval_topk 10 --rerank_topk 5

Web search servers:

python3 -m src.internal.servers.web_search.serp \
  --search_url "https://serpapi.com/search" --topk 3 --serp_api_key "$SERP_API_KEY"

python3 -m src.internal.servers.web_search.google \
  --api_key "$GOOGLE_API_KEY" --topk 5 --cse_id "$GOOGLE_CSE_ID" --snippet_only

Health check:

curl -i -sS http://127.0.0.1:8001/health
curl -i -sS -X POST http://127.0.0.1:8001/retrieve \
  -H "Content-Type: application/json" -d '{"query":"What is FAISS?","topk":5}'

Neural Reranking

RetrievalService optionally reranks hybrid-fused results via a layered wrapper chain. Set RERANKER_PROVIDER to enable; all wrappers are opt-in via env vars and compose on top of the unchanged Reranker leaf.

Wrapper chain (outermost → innermost):

TwoStageReranker → CachedReranker → AsyncReranker → Reranker (leaf)

Enable local BGE reranking:

RERANKER_PROVIDER=local RERANKER_MODEL=BAAI/bge-reranker-v2-m3 \
  PYTHONPATH=src:. uvicorn src.internal.servers.web.app:app --host 127.0.0.1 --port 7860

Enable Cohere reranking:

RERANKER_PROVIDER=cohere RERANKER_MODEL=rerank-english-v3.0 COHERE_API_KEY=... \
  PYTHONPATH=src:. uvicorn src.internal.servers.web.app:app --host 127.0.0.1 --port 7860

Enable async + Redis cache wrapper:

RERANKER_PROVIDER=local RERANKER_ASYNC=true \
  RERANKER_TIMEOUT_MS=500 RERANKER_CACHE_REDIS_URL=redis://localhost:6379 \
  PYTHONPATH=src:. uvicorn src.internal.servers.web.app:app --host 127.0.0.1 --port 7860

Enable two-stage pipeline (fast pre-filter → heavy scorer):

RERANKER_PROVIDER=local RERANKER_TWO_STAGE=true \
  RERANKER_FAST_MODEL=BAAI/bge-reranker-base \
  RERANKER_PRE_FILTER_TOP_N=50 RERANKER_OVER_FETCH_MULTIPLIER=2.0 \
  PYTHONPATH=src:. uvicorn src.internal.servers.web.app:app --host 127.0.0.1 --port 7860

ONNX runtime (lower latency than PyTorch, requires pip install optimum[onnxruntime]):

RERANKER_PROVIDER=local RERANKER_USE_ONNX=true RERANKER_MODEL=BAAI/bge-reranker-base \
  PYTHONPATH=src:. uvicorn src.internal.servers.web.app:app --host 127.0.0.1 --port 7860

Evaluate reranker quality and latency:

# Baseline vs reranked NDCG/MRR + per-query latency
python -m src.internal.retrieval.eval_runner \
  --dataset data/eval/qa_pairs.jsonl --top_k 10 \
  --reranker local --reranker_model BAAI/bge-reranker-v2-m3 \
  --compare-baseline --slo-ms 200

# Output JSON:
# { "retrieval":  {"ndcg@10": 0.48, "mrr": 0.63},
#   "reranked":   {"ndcg@10": 0.55, "mrr": 0.71, "map@10": 0.52},
#   "latency_ms": {"mean": 312, "p50": 290, "p99": 680, "n": 50},
#   "reranker_improvement_ratio": 0.145 }

Benchmark model configurations offline:

python -m src.internal.retrieval.reranker_benchmark \
  --qa-pairs data/eval/qa_pairs.jsonl \
  --models BAAI/bge-reranker-base BAAI/bge-reranker-v2-m3 \
  --batch-sizes 8 16 32 \
  --max-tokens 256 512 \
  --output results/reranker_bench.jsonl
# Prints ranked table sorted by NDCG@10

Retrieval Server API

The retrieval server (src/internal/servers/retrieval/server.py, examples use :8001) exposes the retrieval core over HTTP. The demo server (demo.py, TF-IDF) only serves POST /retrieve; the full server below adds per-mode and admin endpoints.

Health:

curl -s http://localhost:8001/health
# → {"status": "ok", "backend": "local"}

Hybrid search with metadata filters (POST /search — sparse+dense → RRF → MMR → optional rerank):

curl -s -X POST http://localhost:8001/search \
  -H "Content-Type: application/json" \
  -d '{"query": "what is FAISS?", "top_k": 5, "filters": {"source": "arxiv"}}'
# → {"results": [{"doc_id": "...", "title": "...", "text": "...", "score": 0.71, ...}],
#    "retrieval_mode": "hybrid", "executed_queries": ["what is FAISS?"], "latency_ms": 41.2}

Per-mode retrieval (/internal/search/* — isolate one retrieval strategy, e.g. for evals):

# Sparse (BM25) only
curl -s -X POST http://localhost:8001/internal/search/sparse \
  -H "Content-Type: application/json" -d '{"query": "vector database", "top_k": 5}'
# → retrieval_mode: "sparse"

# Dense (embeddings) only
curl -s -X POST http://localhost:8001/internal/search/dense \
  -H "Content-Type: application/json" -d '{"query": "vector database", "top_k": 5}'
# → retrieval_mode: "dense"

# Hybrid with explicit fusion/MMR knobs
curl -s -X POST http://localhost:8001/internal/search/hybrid \
  -H "Content-Type: application/json" \
  -d '{"query": "vector database", "top_k": 5, "over_fetch": 4, "mmr_lambda": 0.5}'
# → retrieval_mode: "hybrid"

# GraphRAG (entity-graph re-ranking)
curl -s -X POST http://localhost:8001/internal/search/graph \
  -H "Content-Type: application/json" -d '{"query": "who founded OpenAI", "top_k": 5}'
# → retrieval_mode: "graph"

Demo server (demo.py, TF-IDF, no Java/embeddings — note topk):

curl -s -X POST http://localhost:8001/retrieve \
  -H "Content-Type: application/json" -d '{"query": "what is FAISS?", "topk": 5}'

Standalone reranker (rerank.py — batch interface: queries + per-query documents lists):

curl -s -X POST http://localhost:8001/rerank \
  -H "Content-Type: application/json" \
  -d '{"queries": ["what is FAISS?"],
       "documents": [[{"title": "FAISS", "content": "FAISS is a similarity search library"},
                      {"title": "Cats", "content": "Cats are mammals"}]],
       "rerank_topk": 2}'

Inspect / hot-reload retrieval config (admin):

curl -s http://localhost:8001/api/admin/retrieval/stats
curl -s -X PATCH http://localhost:8001/api/admin/retrieval/config \
  -H "Content-Type: application/json" \
  -d '{"rrf_k": 80, "mmr_lambda": 0.4, "nprobe": 96, "result_cache_ttl": 600}'

Web Backend API

The FastAPI web backend (src/internal/servers/web/app.py, :7860) drives the UI and agent loops.

Run the intent-routed agent (POST /api/agent) — auto-routes search / chat / tool; response.intent reflects the chosen path:

curl -s -X POST http://localhost:7860/api/agent \
  -H "Content-Type: application/json" \
  -d '{"query": "Compare dense and sparse retrieval", "mode": "chat_loop", "top_k": 5}'
# → {"answer": "...", "intent": "chat", "citations": ["[D1]"], "documents": [...], "session_id": "..."}

response.intent is "search" | "chat" | "tool" and is the single field that drives the intent-adaptive layout (App.tsx maps it to a .results-layout class). Read just that field:

curl -s -X POST http://localhost:7860/api/agent \
  -H "Content-Type: application/json" \
  -d '{"query": "find the onboarding checklist", "top_k": 5}' \
  | python -c "import sys, json; print(json.load(sys.stdin)['intent'])"
# → search

Stream the same over SSE (POST /api/agent/stream) — emits one progress event after each agent turn (via the on_turn callback), then answer, then done (which carries intent, citations, and documents; the frontend feeds intent to setIntent). The non-streaming /api/agent is unchanged:

curl -sN -X POST http://localhost:7860/api/agent/stream \
  -H "Content-Type: application/json" \
  -d '{"query": "Compare dense and sparse retrieval", "top_k": 5}'
# Server-Sent Events (one JSON object per `data:` line):
# data: {"type": "progress", "turn": 1, "text": "search_routing_tool · 5 docs"}
# data: {"type": "progress", "turn": 2, "text": "writing answer…"}
# data: {"type": "answer",   "text": "Dense retrieval embeds the query …"}
# data: {"type": "done",     "session_id": "...", "intent": "chat", "citations": ["[D1]"], "documents": [...]}

On failure the stream yields data: {"type": "error", "detail": "..."} instead of done, which streamAgent surfaces as the error banner.

Sessions:

curl -s -X POST http://localhost:7860/api/sessions \
  -H "Content-Type: application/json" -d '{"title": "Search session"}'
curl -s http://localhost:7860/api/sessions/{session_id}

Submit retrieval feedback (POST /api/feedback — drives the feedback-GRPO training signal):

curl -s -X POST http://localhost:7860/api/feedback \
  -H "Content-Type: application/json" \
  -d '{"session_id": "sess-123", "signal": "thumbs_up"}'
# → {"ok": true}

Chat & Session API

Chat session management and search-flow routing live on the web backend (:7860) under the /chat, /search, and /query routers (src/internal/servers/query_and_chat/). The streamed send-message flow itself is POST /api/agent / /api/agent/stream above; these endpoints manage the sessions and feedback around it.

Chat sessions (/chat):

# Create a session
curl -s -X POST http://localhost:7860/chat/create-chat-session \
  -H "Content-Type: application/json" -d '{"title": "Onboarding questions"}'
# → {"chat_session_id": "..."}

# List the user's sessions / fetch one with its messages
curl -s http://localhost:7860/chat/get-user-chat-sessions
curl -s http://localhost:7860/chat/get-chat-session/{session_id}

# Rename / delete
curl -s -X PUT http://localhost:7860/chat/rename-chat-session \
  -H "Content-Type: application/json" \
  -d '{"chat_session_id": "...", "name": "Renamed"}'
curl -s -X DELETE http://localhost:7860/chat/delete-chat-session/{session_id}

Per-message feedback (POST /chat/create-chat-message-feedback):

curl -s -X POST http://localhost:7860/chat/create-chat-message-feedback \
  -H "Content-Type: application/json" \
  -d '{"chat_message_id": "...", "is_positive": true, "feedback_text": "spot on"}'

Search-flow classification (POST /search/search-flow-classification — keyword-search vs chat routing):

curl -s -X POST http://localhost:7860/search/search-flow-classification \
  -H "Content-Type: application/json" -d '{"user_query": "find the Q3 onboarding deck"}'
# → {"is_search_flow": true}

Direct search message (POST /search/send-search-message — optional query expansion, streamable):

curl -s -X POST http://localhost:7860/search/send-search-message \
  -H "Content-Type: application/json" \
  -d '{"search_query": "vector database benchmarks", "run_query_expansion": true, "num_hits": 10, "stream": false}'

Search history (GET /search/search-history):

curl -s http://localhost:7860/search/search-history

GET /query/standard-answer exists but is an Enterprise-gated stub — it returns 501 ("Standard Answers is an Enterprise feature … not available in this deployment") in the open-source build.

Retrieval Optimization

All optimization components are opt-in; unset env vars = unchanged M1–M4 behavior.

Tune BM25 parameters against your QA pairs:

curl -s -X POST http://localhost:8001/internal/optimize/bm25-tune \
  -H "Content-Type: application/json" \
  -d '{"qa_pairs_path": "data/eval/qa_pairs.jsonl", "k1_range": [0.6, 0.9, 1.2], "b_range": [0.5, 0.75]}' \
  -H "Authorization: Bearer $TOKEN"
# → {"k1": 0.9, "b": 0.6, "score": 0.86}

Learn fusion weights (sparse vs dense RRF weights):

curl -s -X POST http://localhost:8001/internal/optimize/fusion-weights \
  -H "Content-Type: application/json" \
  -d '{"qa_pairs_path": "data/eval/qa_pairs.jsonl"}' \
  -H "Authorization: Bearer $TOKEN"
# → {"w_sparse": 0.38, "w_dense": 0.62}

Tune HNSW ef_search for a recall target:

curl -s -X POST http://localhost:8001/internal/optimize/hnsw-tune \
  -H "Content-Type: application/json" \
  -d '{"target_recall": 0.82}' \
  -H "Authorization: Bearer $TOKEN"
# → {"ef_search": 96, "measured_recall": 0.831}

Retrieval stats (cache hit rate, latency, throughput):

curl -s http://localhost:7860/api/admin/retrieval/stats \
  -H "Authorization: Bearer $TOKEN"
# → {"result_cache_hit_rate": 0.42, "p99_latency_ms": 112, "throughput_qps": 87, ...}

Hot-reload tunable parameters without restart:

curl -s -X PATCH http://localhost:7860/api/admin/retrieval/config \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"rrf_k": 80, "mmr_lambda": 0.4, "nprobe": 96, "result_cache_ttl": 600}'
# → {"applied": ["rrf_k", "mmr_lambda", "nprobe", "result_cache_ttl"]}

Enable query expansion and result caching:

QUERY_EXPANSION_ENABLED=true SPELL_CORRECTION_ENABLED=true EXPANSION_MAX_TERMS=3 \
  BM25_VARIANT=bm25plus \
  RESULT_CACHE_REDIS_URL=redis://localhost:6379 RESULT_CACHE_TTL=300 \
  ADAPTIVE_MMR=true \
  PYTHONPATH=src:. uvicorn src.internal.servers.web.app:app --host 127.0.0.1 --port 7860

Build an IVF-PQ FAISS index (cuts memory from ~30 GB to ≤ 4 GB at 10 M docs):

from src.internal.retrieval.index_optimizer import FAISSIndexBuilder
import numpy as np

builder = FAISSIndexBuilder()
index = builder.build_ivfpq(embeddings, nlist=4096, m=96, nbits=8, nprobe=64)
# Save alongside existing index; load via FAISS_INDEX_TYPE=ivfpq

Query Transformation Optimization

A layered-wrapper optimization stack over QueryTransformPipeline, parallel to Neural Reranking. Every layer is opt-in; with all QT_* unset, RetrievalService runs the single-query path unchanged (build_query_transform_pipeline_from_env returns None).

Wrapper chain (outermost → innermost):

RoutedQueryTransformPipeline → CachedQueryTransformPipeline → AsyncQueryTransformPipeline → QueryTransformPipeline (leaf)

Enable parallel transforms + Redis bundle cache:

QT_DECOMPOSE=true QT_HYDE=true QT_STEP_BACK=true \
  QT_ASYNC=true QT_TRANSFORM_TIMEOUT_MS=400 \
  QT_CACHE_REDIS_URL=redis://localhost:6379 QT_CACHE_TTL_SECONDS=600 \
  PYTHONPATH=src:. uvicorn src.internal.servers.web.app:app --host 127.0.0.1 --port 7860

Enable Multi-Query + weighted RAG-Fusion:

QT_MULTI_QUERY=true QT_MULTI_QUERY_N=3 QT_FUSION_WEIGHTED=true \
  PYTHONPATH=src:. uvicorn src.internal.servers.web.app:app --host 127.0.0.1 --port 7860

Enable per-query learned routing (heuristic until an artifact exists):

QT_ROUTER=true QT_ROUTER_MODEL_PATH=data/query_router.joblib \
  PYTHONPATH=src:. uvicorn src.internal.servers.web.app:app --host 127.0.0.1 --port 7860

QT_ROUTER and QT_MULTI_QUERY each activate the pipeline on their own — no other QT_* flag is required.

Query transformation is backend-only — there is no dedicated HTTP endpoint and no query-transform-specific UI. The pipeline runs inside RetrievalService.from_env(), so it applies to both the retrieval server's /search and the web backend's /api/agent. Its observable effect is the +rag_fusion suffix on retrieval_mode.

Test it on the retrieval server (POST /search — retrieval_mode reflects the transform):

# Start the retrieval server with QT flags enabled, then:
curl -s -X POST http://localhost:8001/search \
  -H "Content-Type: application/json" \
  -d '{"query": "Compare dense and sparse retrieval", "top_k": 5}' \
  | python -c "import sys, json; print(json.load(sys.stdin)['retrieval_mode'])"
# → hybrid+rag_fusion

Test it on the web backend (POST /api/agent):

curl -s -X POST http://localhost:7860/api/agent \
  -H "Content-Type: application/json" \
  -d '{"query": "Compare dense and sparse retrieval", "mode": "chat_loop", "top_k": 5}' \
  | python -m json.tool | grep -i retrieval_mode
# → "retrieval_mode": "hybrid+rag_fusion"   (or "hybrid+rag_fusion+reranked" with a reranker)

Extract metadata filters from natural language (numeric operators behind QT_CONSTRUCT_OPERATORS):

QT_CONSTRUCT_FILTERS=true QT_CONSTRUCT_OPERATORS=true \
  PYTHONPATH=src:. uvicorn src.internal.servers.web.app:app --host 127.0.0.1 --port 7860
# "arxiv papers after 2023 rated above 4" → filters {date_after: "2023-...", rating_gte: 4}
curl -s -X POST http://localhost:7860/api/agent \
  -H "Content-Type: application/json" \
  -d '{"query": "arxiv papers after 2023 rated above 4 on retrieval", "mode": "chat_loop", "top_k": 5}'

Train the learned router offline:

python -m src.training.train_query_router --out data/query_router.joblib
# → wrote data/query_router.joblib
# Predicts 7 transform labels: decompose, hyde, step_back, keywords, construct_filters, multi_query, rewrite

Gate transform latency in CI:

python -m src.internal.retrieval.eval_runner \
  --dataset data/eval/qa_pairs.jsonl --top_k 10 --qt-slo-ms 300
# Records per-query "qt_latency_ms"; exits non-zero when P99 transform latency > 300ms

Benchmark technique combinations offline (Python API; the --dataset CLI ships a stub retrieve_fn to wire to your retriever):

from src.context.query_transform import QueryTransformConfig
from src.internal.retrieval.query_transform_benchmark import run_query_transform_benchmark

dataset = [("what is FAISS", {"doc-1"}), ("compare BM25 and dense", {"doc-2"})]

def retrieve(query, config):
    # build a pipeline from `config`, run RetrievalService.search, return ranked doc_ids
    ...

rows = run_query_transform_benchmark(dataset, retrieve, [
    QueryTransformConfig(),
    QueryTransformConfig(multi_query=True),
    QueryTransformConfig(decompose=True, hyde=True),
], k=10)
# → [{"config_signature": "...", "recall": 0.91, "ndcg": 0.78, "mean_latency_ms": 142.0}, ...]

Routing & Query Construction

The RAG Routing → Query Construction stage (src/internal/routing/). It decides where a query should go (domain → source → retriever) and how to express it for the chosen backend. Distinct from Intent Routing (web-level search/chat/tool) and from QueryRouter (which picks transforms): this layer picks the retriever/construction target per query.

Backend-only and default-off. With no ROUTING_* env set, build_router_from_env() returns None, RetrievalService.search skips the routing branch entirely, and behavior is byte-identical to today — zero overhead, no frontend change. There is no dedicated HTTP endpoint or UI; routing runs inside RetrievalService.from_env().

Pipeline:

query → Router.route() → RouteDecision(domain, sources, retriever, construction_target)
      → QueryConstructor.construct() → ConstructedQuery(target, payload, text)

Router strategies (heuristic default; LLM strategies fall back to it on any failure):

Strategy	Env	How it routes
Heuristic	(default)	Rule-based cue matching → SQL / GRAPH / API / default HYBRID. No LLM; the path the accuracy gate runs against
Logical	`ROUTING_LOGICAL=true`	LLM structured-classification into a registered route by name
Semantic	`ROUTING_SEMANTIC=true`	Embedding cosine between the query and each route's description

Routes come from a config-driven registry (ROUTING_REGISTRY_PATH → JSON of {name, description, sources, retriever}; a built-in default mirrors the local corpus). RetrieverTarget ∈ sparse · dense · hybrid · metadata · sql · graph · api.

Six query constructors (construction/, one construct(query, route) -> ConstructedQuery interface):

Constructor	Target	Backing	Output
Metadata Filter	`metadata`	wraps `QueryConstructor`	NL → `{filters}` + cleaned query
Vector Search	`dense`	params	`{top_k, namespace, filters}`
Hybrid Retrieval	`hybrid`	reuses `adaptive_mmr_lambda`	`{rrf_k, w_sparse, w_dense, mmr_lambda}`
SQL Generation	`sql`	net-new (no exec)	schema-aware Text-to-SQL, SELECT-only + table allowlist + multi-statement reject
Knowledge Graph	`graph`	net-new (no exec)	read-only Cypher (`MATCH…RETURN`), word-boundary write-clause rejection
API Request	`api`	net-new (no exec)	`{endpoint, params}` filtered to an `ApiSpec` allowlist

The three net-new constructors build and validate but never execute a query — there is no live SQL/KG/API backend, so RetrievalService short-circuits the sql/graph/api targets to ([], "routed:<target>"). When a real backend is wired later, only the executor changes. Every route()/construct() degrades to a safe empty/None payload rather than raising.

Enable per-query routing:

ROUTING_ENABLED=true \
  PYTHONPATH=src:. uvicorn src.internal.servers.web.app:app --host 127.0.0.1 --port 7860
# Optional LLM strategies + a custom route registry:
ROUTING_ENABLED=true ROUTING_LOGICAL=true ROUTING_SEMANTIC=true \
  ROUTING_REGISTRY_PATH=data/routes.json  uvicorn ...

Score routing accuracy (heuristic router; no LLM needed):

python -m src.internal.retrieval.eval_runner \
  --routing-eval --dataset data/eval/routing_labels.jsonl
# → {"routing_accuracy": 1.0, "num_queries": 12}

Training

The training pipeline is modular: generate trajectories → score with rewards → compute advantages → optimize.

Task	Entry point
QA parquet preparation	`python3 -m examples.prepare_search_qa_dataset`
Training data (shell)	`bin/generate_training_data.sh`
Reward/GRPO smoke test	`python3 -m examples.run_grpo_training_pipeline`
Bamboogle benchmark eval	`python3 -m examples.run_bamboogle_eval` / `bin/run_bamboogle_eval.sh`
Reward function	`src/training/reward.py`
GRPO helpers	`src/training/grpo.py`
Online GRPO for HF LMs	`src/training/ppo/llm_grpo_trainer.py`
Agent-loop GRPO (full reward)	`src/training/ppo/search_agent_grpo_trainer.py`
PPO core	`src/training/ppo/core_algos.py`
Generation and policy loss	`src/model/generation.py`
Feedback-driven GRPO	`python3 -m examples.run_feedback_grpo`
SFT warm-start + GRPO	`python3 -m examples.run_sft_grpo`

Fine-tune from user feedback — train directly on thumbs-up/down sessions collected via POST /api/feedback (no GPU required for the smoke path; --device mps on Apple Silicon):

# Feedback-driven GRPO: load rated sessions from the web DB → reward with human_signal → update
python3 -m examples.run_feedback_grpo \
  --db_path data/feedback.sqlite3 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --min_ratings 10 --human_feedback_weight 0.5 \
  --num_rollouts 4 --search_url http://localhost:8001/retrieve --device mps \
  --output_dir data/checkpoints/feedback_grpo/

# SFT warm-start (Phase 1, assistant-token-only CE on thumbs-up traces) then GRPO (Phase 2);
# --sft_epochs 0 skips Phase 1 and runs pure GRPO from the base model
python3 -m examples.run_sft_grpo \
  --db_path data/feedback.sqlite3 --model Qwen/Qwen2.5-1.5B-Instruct \
  --jsonl_path data/sft_pairs.jsonl \
  --sft_epochs 3 --sft_lr 2e-5 --sft_output_dir data/checkpoints/sft_warmstart/ \
  --grpo_output_dir data/checkpoints/sft_grpo/ --device mps

load_feedback_examples raises if fewer than --min_ratings rated sessions exist, so collect feedback first (thumbs in the UI, or POST /api/feedback). There is no HTTP training endpoint — fine-tuning is offline by design; the only backend endpoint in this loop is POST /api/feedback (see Web Backend API).

Reward components (SearchRewardFunction):

Component	Config field	What it measures
Correctness	`correctness_weight`	Judge score against gold answer (EM / contains-match)
Citation support	`citation_support_weight`	Fraction of retrieved docs cited in the final answer
Subquestion coverage	`subquestion_coverage_weight`	Fraction of sub-questions with sufficient evidence
Search quality	`search_quality_weight`	Evaluator verdict + per-query search quality
Unnecessary search	`unnecessary_search_penalty`	Penalty per search round beyond the first
Unnecessary fetch	`unnecessary_fetch_penalty`	Penalty per fetched page not cited in the answer
Fetch usefulness	`fetch_usefulness_reward`	Bonus when fetched pages are cited in the final answer
Format compliance	`format_reward_weight`	Structural compliance in the final answer
Human feedback	`human_feedback_weight`	`human_signal` (±1.0) from thumbs-up/down sessions; `0.0` by default (off)

Reward preset names: sparse_final_only | simple_sparse_with_search_penalty | second_pass | third_pass_with_format (see SearchRewardConfig in src/training/reward.py).

GRPO — score_prompt_group scores G rollouts for one prompt and normalises within-group advantages. compute_grpo_outcome_advantage computes reward_i - mean(group) for a flat rewards list. See src/training/grpo.py.

PPO — compute_ppo_policy_loss_core returns (pg_loss, pg_clipfrac, ppo_kl, surrogate); compute_value_loss returns (vf_loss, vf_clipfrac). Both require an eos_mask tensor. See src/training/ppo/core_algos.py.

Smoke test (end-to-end reward + GRPO, no GPU):

python3 -m examples.run_grpo_training_pipeline

XML search protocol — the ReAct-style trace format used by SearchAgentLoop:

Model-output tags:

<think>decide whether to answer or search</think>
<search>one precise query when external evidence is needed</search>
<fetch>comma- or newline-separated URLs when snippets are insufficient</fetch>
<answer>final grounded answer with citation labels</answer>

Optional model-output tags for multi-hop tasks:

<search_decision>answer</search_decision>   <!-- skip search when internal knowledge suffices -->
<subquestions>one research subquestion per line</subquestions>
<searches>parallel independent queries, one per line</searches>

Environment-only tags (injected by the loop — never output by the model):

<information>search results with citation labels</information>
<search_evaluation>sufficiency verdict and weak-query hints</search_evaluation>
<subquestions_feedback>per-subquestion coverage status</subquestions_feedback>
<full_page>fetched page content</full_page>

Mask all environment-only tags from policy/SFT action loss.

MCP Server

The MCP server exposes Agentic Search capabilities as Model Context Protocol tools, letting any MCP-compatible client (Claude Desktop, Cursor, etc.) query your knowledge base directly.

Start the server (requires the mcp extra):

pip install -e ".[mcp]"
uvicorn src.internal.mcp_server.api:mcp_app --port 8090

Connect Claude Desktop — add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "agentic-search": {
      "type": "http",
      "url": "http://localhost:8090/",
      "headers": { "Authorization": "Bearer YOUR_TOKEN_HERE" }
    }
  }
}

Tools available to the LLM client:

Tool	What it does
`search_indexed_documents`	Search the private knowledge base with optional source filter
`search_web`	Web search via Google Custom Search or SerpAPI
`open_urls`	Fetch full page text from a list of URLs
`ask_agentic_search`	Full `SearchAgentLoop` answer with citations
`retrieve_documents`	Raw retrieval — returns full document content and relevance scores
`expand_query`	Query decomposition and HyDE expansion

Dynamic tools registered via FunctionTool / ApiToolRegistry can be mirrored to MCP by calling sync_tool_to_mcp(name) after registration (src/internal/mcp_server/tools/dynamic.py).

Resources:

Resource	What it exposes
`indexed_sources`	Available retrieval source types based on configured API keys
`document_sets`	Document sets scoped for search

Debug with MCP Inspector:

npx @modelcontextprotocol/inspector http://localhost:8090/

MCP environment variables:

Var	Default	Description
`MCP_SERVER_CORS_ORIGINS`	—	Comma-separated allowed origins for CORS
`API_SERVER_HOST`	`127.0.0.1`	Host of the web backend
`API_SERVER_PROTOCOL`	`http`	Protocol for the web backend URL
`API_SERVER_URL_OVERRIDE_FOR_HTTP_REQUESTS`	—	Override the full web backend URL

Evaluation

Bamboogle

Bamboogle is a two-hop QA benchmark that requires chaining retrieval across multiple hops — a strong signal for SearchAgentLoop quality.

CLI (local CPU):

python3 -m examples.run_bamboogle_eval \
  --model Qwen/Qwen2.5-1.5B-Instruct --local --limit 5 --print_trace

CLI (server-backed):

python3 -m examples.run_bamboogle_eval \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --vllm_url http://localhost:8080 \
  --search_url http://localhost:8001/retrieve \
  --reward_preset second_pass --limit 125

Reward presets: sparse_final_only | simple_sparse | second_pass | third_pass

Apple Silicon shell script (auto-starts SerpAPI retrieval server, reads SERP_API_KEY from .env):

bin/run_bamboogle_eval.sh                              # 5 examples, mps device
bin/run_bamboogle_eval.sh --smoke                      # 1 example, quick sanity check
bin/run_bamboogle_eval.sh --limit 125                  # full benchmark
bin/run_bamboogle_eval.sh --device cpu --limit 10
bin/run_bamboogle_eval.sh --limit 125 --concurrency 8  # ~6-8x faster via parallel SerpAPI calls
bin/run_bamboogle_eval.sh --limit 125 --concurrency 8 --resume  # resume an interrupted run

The dataset is cached locally after the first download (~/.cache/agentic_search/bamboogle_test.jsonl), so subsequent runs skip the network fetch. --resume reads the existing output file and skips already-evaluated questions, appending new results.

Training data generation:

bin/generate_training_data.sh                         # Bamboogle → data/bamboogle_train/
bin/generate_training_data.sh --preview               # print 5 sample rows, no write
bin/generate_training_data.sh --dataset nq            # Natural Questions
bin/generate_training_data.sh --dataset trivia_qa     # TriviaQA
bin/generate_training_data.sh --dataset hotpotqa --max_examples 500

Each run writes data/<dataset>_train/train.parquet and data/<dataset>_train/test.parquet ready for LLMGRPOTrainer or SFT.

API Health Checks

Web backend: http://localhost:7860 · Retrieval server: http://localhost:8001

Generate a dev JWT (required for admin endpoints):

export TOKEN=$(bin/gen_dev_token.sh)   # or: source bin/gen_dev_token.sh

Core

curl -s http://localhost:7860/health                  # web server
curl -s http://localhost:8001/health                  # retrieval server
curl -s http://localhost:7860/settings                # tier / license status (no auth)

Search & chat

curl -s -X POST http://localhost:7860/api/agent \
  -H "Content-Type: application/json" \
  -d '{"query": "What is FAISS?", "mode": "search_tool"}'

curl -s http://localhost:7860/api/sessions/SESSION_ID -H "Authorization: Bearer $TOKEN"

curl -s -X POST http://localhost:8001/retrieve \
  -H "Content-Type: application/json" -d '{"query": "dense retrieval", "topk": 3}'

Admin — analytics, billing, reporting

curl -s "http://localhost:7860/analytics/query?start=2024-01-01&end=2025-12-31" \
  -H "Authorization: Bearer $TOKEN"
curl -s http://localhost:7860/admin/billing/billing-information -H "Authorization: Bearer $TOKEN"
curl -s http://localhost:7860/admin/usage-report                -H "Authorization: Bearer $TOKEN"

Admin — hooks, rate limits, web search

curl -s http://localhost:7860/admin/hooks/specs              -H "Authorization: Bearer $TOKEN"
curl -s http://localhost:7860/admin/hooks                    -H "Authorization: Bearer $TOKEN"
curl -s http://localhost:7860/admin/token-rate-limits/users  -H "Authorization: Bearer $TOKEN"
curl -s http://localhost:7860/admin/web-search/search-providers -H "Authorization: Bearer $TOKEN"

Admin — license

curl -s http://localhost:7860/license       -H "Authorization: Bearer $TOKEN"
curl -s http://localhost:7860/license/seats -H "Authorization: Bearer $TOKEN"

SCIM (uses SCIM bearer token, not a JWT)

curl -s http://localhost:7860/scim/v2/ServiceProviderConfig  # no auth
curl -s http://localhost:7860/scim/v2/Users  -H "Authorization: Bearer $SCIM_TOKEN"
curl -s http://localhost:7860/scim/v2/Groups -H "Authorization: Bearer $SCIM_TOKEN"

Configuration

Env var	Default	Description
`AGENTIC_SEARCH_AUTH_SECRET`	`agentic-search-dev-secret`	JWT signing secret
`AGENTIC_SEARCH_SUPER_USERS`	`[]`	JSON list of admin user IDs or emails
`AGENTIC_SEARCH_WEB_DB_PATH`	`:memory:`	SQLite path (`:memory:` for ephemeral)
`AGENTIC_SEARCH_RETRIEVAL_URL`	`http://localhost:8001/retrieve`	Retrieval server URL
`AGENTIC_SEARCH_CLOUD_DATA_PLANE_URL`	—	Cloud data plane for billing proxy
`AGENTIC_SEARCH_LICENSE_ENFORCEMENT_ENABLED`	`false`	Enable license gating
`AGENTIC_SEARCH_DATA_DIR`	`~/.local/share/agentic_search`	License file directory
`WEB_DOMAIN`	`http://localhost:8080`	External URL for OAuth redirects
`GEN_AI_MODEL_PROVIDER`	`openai`	LLM provider (openai, anthropic, ollama, etc.)
`GEN_AI_MODEL_VERSION`	`gpt-4o-mini`	Model name / version
`GEN_AI_API_KEY`	—	Provider API key
`GEN_AI_API_BASE`	—	Override base URL (e.g. `http://localhost:11434/v1`)
`OAUTH_SLACK_CLIENT_ID`	—	Slack OAuth app client ID
`OAUTH_CONFLUENCE_CLOUD_CLIENT_ID`	—	Confluence OAuth app client ID
`OAUTH_GOOGLE_DRIVE_CLIENT_ID`	—	Google Drive OAuth app client ID
`RERANKER_PROVIDER`	—	`local` or `cohere`; omit to disable neural reranking in `RetrievalService`
`RERANKER_MODEL`	`BAAI/bge-reranker-v2-m3`	Cross-encoder model for local reranking
`RERANKER_BATCH_SIZE`	`32`	Batch size for local cross-encoder
`RERANKER_DEVICE`	`cpu`	Device for local reranker (`cpu`, `mps`, `cuda`)
`RERANKER_TOP_K`	same as search `top_k`	Cap returned results after reranking
`COHERE_API_KEY`	—	Cohere API key (required when `RERANKER_PROVIDER=cohere`)
`RERANKER_ASYNC`	`false`	Wrap reranker in `AsyncReranker` (thread-pool offload)
`RERANKER_TIMEOUT_MS`	`500`	Per-query scorer timeout for `AsyncReranker`
`RERANKER_MAX_WORKERS`	`4`	Thread pool size for `AsyncReranker`
`RERANKER_CACHE_REDIS_URL`	—	Enable `CachedReranker`; set to a Redis URL
`RERANKER_CACHE_TTL_SECONDS`	`300`	TTL for cached reranker scores
`RERANKER_MAX_TOKENS`	`512`	`PassageTruncator` token limit before scoring (0 = disabled)
`RERANKER_USE_ONNX`	`false`	Load reranker via ONNX runtime (`ONNXReranker`)
`RERANKER_TWO_STAGE`	`false`	Enable `TwoStageReranker` (fast pre-filter → heavy scorer)
`RERANKER_PRE_FILTER_TOP_N`	`50`	Candidates passed to the heavy scorer in two-stage mode
`RERANKER_FAST_MODEL`	inherits `RERANKER_MODEL`	Fast-stage model name in two-stage mode
`RERANKER_OVER_FETCH_MULTIPLIER`	`2.0`	Retrieval over-fetch ratio when a reranker is active
`QUERY_EXPANSION_ENABLED`	`false`	Enable acronym + WordNet synonym expansion in BM25 leg
`SPELL_CORRECTION_ENABLED`	`false`	Enable `symspellpy` spell correction in BM25 leg
`EXPANSION_MAX_TERMS`	`3`	Max added terms per query to prevent BM25 query bloat
`BM25_VARIANT`	—	Set to `bm25plus` to enable BM25+ lower-bound floor (`δ=1.0`)
`FAISS_INDEX_TYPE`	`hnsw`	`ivfpq` for IVF-PQ quantized index; `hnsw` for original
`EF_SEARCH`	—	HNSW `ef_search` override (higher = more recall, slower)
`ADAPTIVE_MMR`	`false`	Select MMR `λ` by query length (short → 0.8, long → 0.3)
`FUSION_WEIGHTS_PATH`	`data/eval/fusion_weights.json`	Learned per-source RRF weights; falls back to uniform if absent
`RESULT_CACHE_REDIS_URL`	—	Enable `ResultCache`; set to a Redis URL
`RESULT_CACHE_TTL`	`300`	TTL in seconds for cached full search responses
`LATENCY_SLO_MS`	`120`	CI SLO gate: P99 above this exits non-zero in `eval_runner`
`QT_DECOMPOSE`	`false`	Enable query decomposition in `QueryTransformPipeline`
`QT_HYDE`	`false`	Enable HyDE (hypothetical document embedding)
`QT_STEP_BACK`	`false`	Enable step-back query rephrasing
`QT_KEYWORDS`	`false`	Enable keyword expansion for BM25 variants
`QT_CONSTRUCT_FILTERS`	`false`	Enable NL → metadata filter extraction
`QT_REWRITE`	`false`	Enable canonical query rewrite (`QueryEnhancer.rewrite`); 7th router label
`QT_MAX_VARIANTS`	`5`	Max parallel retrieval variants when any `QT_*` is enabled
`QT_ASYNC`	`false`	Run the leaf's transform LLM calls in parallel (`AsyncQueryTransformPipeline`)
`QT_TRANSFORM_TIMEOUT_MS`	`400`	Per-transform timeout; on exceed that field degrades to its default
`QT_MAX_WORKERS`	`5`	Thread-pool size for `AsyncQueryTransformPipeline`
`QT_CACHE_REDIS_URL`	—	Enable `CachedQueryTransformPipeline`; set to a Redis URL
`QT_CACHE_TTL_SECONDS`	`600`	TTL for cached transform bundles
`QT_MULTI_QUERY`	`false`	Enable `MultiQueryGenerator` (N paraphrased query variants)
`QT_MULTI_QUERY_N`	`3`	Number of paraphrases generated per query
`QT_FUSION_WEIGHTED`	`false`	Use `variant_weighted_rrf_fuse` (original query weighted highest)
`QT_SEMANTIC_DEDUP`	`false`	Drop near-duplicate variants before retrieval (needs a backend `embed()`)
`QT_SEMANTIC_DEDUP_THRESHOLD`	`0.95`	Cosine cutoff for variant dedup
`QT_ROUTER`	`false`	Per-query routing of transforms (`QueryRouter` + heuristic fallback)
`QT_ROUTER_MODEL_PATH`	—	Serialized scikit-learn router artifact; heuristic used when unset/missing
`QT_CONSTRUCT_OPERATORS`	`false`	Extract numeric range/comparison filters (`rating_gte`/`rating_lte`)
`ROUTING_ENABLED`	`false`	Enable the per-query routing layer in `RetrievalService` (domain/source/retriever + query construction); zero overhead when unset
`ROUTING_LOGICAL`	`false`	Add the LLM structured-classification router strategy (falls back to heuristic)
`ROUTING_SEMANTIC`	`false`	Add the embedding-similarity router strategy (falls back to heuristic)
`ROUTING_REGISTRY_PATH`	—	JSON route registry (`{name, description, sources, retriever}`); built-in default used when unset

Tests

pytest                           # full suite
pytest tests/unit/ -v            # unit only
pytest tests/unit/servers/ -v    # server-focused
pytest tests/unit/test_reward.py tests/unit/test_grpo.py tests/unit/test_llm_agent_generation.py -v

# Integration (requires live server, default http://localhost:8080)
pytest tests/integration/ -v
API_SERVER_HOST=localhost API_SERVER_PORT=8080 pytest tests/integration/

Test area	What is tested
`server/billing/`	Circuit breaker state, endpoint responses, HTTP mocks
`server/features/hooks/`	SSRF safety, endpoint validation, `HookValidateStatus`
`server/license/`	PEM stripping, `_strip_pem` boundary cases
`server/middleware/`	Path allowlist, license enforcement, tier gating
`server/settings/`	`_load_license_status`, `/settings` endpoint
`server/web/test_tool_trace.py`	`ToolCallView` trace parsing, latency rounding, list/string summarisation, error forwarding
`utils/test_license_utils.py`	RSA signature verification with real key pairs
`utils/test_license_expiry.py`	18 parametrized `ExpiryWarningStage` boundary points
`utils/test_tier.py`	`get_tier` + `tier_at_least` matrix

Frontend tests (web/src/components/__tests__/):

Test file	What is tested
`App.test.tsx`	SSE streaming flow, intent class applied per response, reset on new session
`AnswerPanel.test.tsx`	Markdown rendering, `[D1]` citation link generation, `ReactNode[]` children handling
`SessionTimeline.test.tsx`	Chat bubble layout, system message filtering, stable React keys
`SourceGrid.test.tsx`	Card expand/collapse, copy button 1.5 s feedback, `id` anchor attribute
`ToolCallTracePanel.test.tsx`	Empty→null, completed/failed card classes, latency display, JSON arguments

Notes

Dense retrieval defaults to CPU; set --device cuda on a dedicated retrieval node or --device mps on Apple Silicon.
MPS acceleration is available for local inference (--device mps); add --allow_unsafe_mps to suppress PyTorch MPS safety warnings.
BM25 serving requires Java because Pyserini uses Lucene.
Empty or invalid queries return empty result lists.
Some web pages block scraping or return little usable text.
Google Custom Search and SerpAPI are subject to their own quota and billing rules.
If prepare_search_qa_dataset fails with a pyarrow extension error, run pip install -r requirements.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 1,118 Commits
.agents/skills/playwright-cli		.agents/skills/playwright-cli
.claude		.claude
.codex		.codex
.github/workflows		.github/workflows
bin		bin
cli		cli
data/eval		data/eval
docker		docker
docs/superpowers		docs/superpowers
examples		examples
models		models
scripts		scripts
src		src
tests		tests
web		web
.dockerignore		.dockerignore
.env.example		.env.example
.gitguardian.yaml		.gitguardian.yaml
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SPEC.md		SPEC.md
agentic-search-grpo-architecture.html		agentic-search-grpo-architecture.html
agentic-search-grpo-architecture.png		agentic-search-grpo-architecture.png
colab-vllm.py		colab-vllm.py
package.json		package.json
pyproject.toml		pyproject.toml
requirements-unit-test.txt		requirements-unit-test.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic Search

Contents

Repository Structure

Install

Quick Start

Frontend

UI features

Intent Routing

Examples

Features

Agentic RAG

Retrieval Setup

Neural Reranking

Retrieval Server API

Web Backend API

Chat & Session API

Retrieval Optimization

Query Transformation Optimization

Routing & Query Construction

Training

MCP Server

Evaluation

Bamboogle

API Health Checks

Configuration

Tests

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agentic Search

Contents

Repository Structure

Install

Quick Start

Frontend

UI features

Intent Routing

Examples

Features

Agentic RAG

Retrieval Setup

Neural Reranking

Retrieval Server API

Web Backend API

Chat & Session API

Retrieval Optimization

Query Transformation Optimization

Routing & Query Construction

Training

MCP Server

Evaluation

Bamboogle

API Health Checks

Configuration

Tests

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages