| title | Autonomous Agent API |
|---|---|
| emoji | π€ |
| colorFrom | blue |
| colorTo | purple |
| sdk | docker |
| app_port | 7860 |
| pinned | false |
A production-grade, multi-phase AI agent with adaptive planning, critic evaluation, semantic memory, and a fine-tuned tool router.
Most AI "agents" are thin wrappers around a single LLM call. This project builds a real agentic system with:
- A Planner that decomposes queries into executable steps
- An Executor that runs each step through 9 real-world tools
- A Critic that evaluates answers for grounding, completeness, and faithfulness
- Semantic memory with hybrid retrieval (vector + keyword) that learns from every interaction
- A fine-tuned ToolForge router (QLoRA, 86% accuracy) that can replace the heuristic classifier
The result is an agent that can handle everything from "what's 15% of 2850?" to "plan a 5-day trip to Japan" β with full observability, structured traces, and self-improving answer quality.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend β
β β
β ββββββββββββ βββββββββββββββ ββββββββββββββββββββββββββββ β
β β Query βββββΆβ Classifier βββββΆβ Planner (LLM-powered) β β
β β β β (Heuristic / β β Decomposes into steps β β
β β β β ToolForge) β βββββββββββββ¬βββββββββββββββ β
β β β βββββββββββββββ β β
β β β β βΌ β
β β β ββββββ΄ββββ ββββββββββββββββββββββββββββ β
β β β βMemory β β Executor β β
β β β βRetrievalβ β ββ Tool steps β Agent β β
β β β β(Hybrid)β β ββ Reasoning steps β β
β β β ββββββββββ β ββ Early stopping β β
β β β β ββ Dynamic plan adjust β β
β β β βββββββββββββ¬βββββββββββββββ β
β β β β β
β β β βββββββββββββββββββββββββββ€ β
β β β βΌ βΌ β
β β β ββββββββββββ βββββββββββββββββ β
β β β β Critic βββββββββββ Synthesizer β β
β β β β (5-axis β β (Natural β β
β β β β eval) ββββββββββΆβ language) β β
β β β ββββββββββββ refine βββββββββββββββββ β
β β β β
β ββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββ β
β β 9 Tools: web_search, calculator, weather, β β
β β wikipedia, datetime, dictionary, translate, β β
β β unit_converter, web_reader β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β² β
β HTTP / JSON API β
βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β React Frontend (Vite) β
β ββ ChatUI with typing indicator + animated responses β
β ββ AgentPanel (live execution trace, tools used, confidence) β
β ββ Client-side memory (localStorage profile + sessionStorage) β
β ββ FloatingBackground (glassmorphism, animated particles) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Every query is classified before any LLM call, using zero-cost regex heuristics:
| Decision | Example | What Happens |
|---|---|---|
direct_answer |
"Hi", "Explain recursion" | Skips planning + tools entirely β 1 LLM call |
needs_search |
"Latest AI news", "Weather in Tokyo" | Full pipeline: plan β tool β synthesize β critique |
memory_sufficient |
"What's my name?" (after telling it) | Answers from stored memory, no tools |
autonomous_task |
"Plan a trip to Japan" | Delegates to AutonomousExecutor (multi-aspect) |
ToolForge upgrade: The heuristic classifier can be replaced by a fine-tuned Qwen2.5-7B model (86.2% accuracy, trained on 1.1K synthetic tool-call traces). Toggle with TOOLFORGE_ENABLED=true.
The Planner decomposes queries into typed steps via LLM:
{
"plan": [
{"step": "Search for latest AI breakthroughs and announcements", "type": "tool"},
{"step": "Extract key developments from search results", "type": "reasoning"},
{"step": "Organize findings into a clear summary", "type": "reasoning"}
]
}- Fast-path:
direct_answerandmemory_sufficientskip the LLM planning call entirely - Adaptive: Plan length adjusts based on query complexity (1β5 steps)
- Validated: Each step must have a specific subject + action verb (vague steps like "explore" are rejected)
The Executor runs each plan step through the Agent with:
- Early stopping: If an intermediate result is strong enough (200+ chars, 3+ entities, 2+ numbers), remaining reasoning steps are skipped
- Dynamic plan adjustment: Trims redundant reasoning steps after strong tool results, or appends fallback reasoning after weak ones
- Weak-result recovery: If a tool returns weak results, retries with a different query, then falls back to reasoning
- Request-scoped tool cache: Prevents duplicate tool calls within a single request
- Context compression: Previous step results are truncated (500β1500 chars) before injection
The core Agent runs a bounded loop (max 4 iterations, max 2 tool calls):
Step 1: LLM β tool_call("web_search", {"query": "..."})
Step 2: Tool result β LLM β final_answer
Safety controls:
- Final step enforcement: Last iteration forces
final_answer, blocks tool calls - Tool call limit: Hard cap on tool invocations per request
- Parse retry: If LLM outputs invalid JSON, feeds error back for one retry
- Invalid tool protection: Unknown tools return structured errors, not crashes
- Reasoning visibility: Every LLM decision includes a
reasoningfield (logged, never in API)
Every synthesized answer is evaluated by the Critic on 5 dimensions:
| Axis | What It Checks |
|---|---|
| Grounding | Are all claims supported by the step results? |
| Completeness | Does the answer address every part of the query? |
| Specificity | Does it include concrete details (names, numbers, dates)? |
| Redundancy | Is it concise without unnecessary repetition? |
| Faithfulness | Are there any hallucinated facts not in the results? |
Each issue is severity-tagged (high / medium / low):
highβ answer is refined (up to 2 iterations)lowβ accepted as-is- Regression guard: refined answer must be β₯70% the length of the previous (prevents oversimplification)
The memory system gives the agent persistent context across conversations:
Storage pipeline:
- Every query runs through the MemoryAnalyzer (LLM-based fact extraction)
- Extracted facts pass through 4 quality gates:
- Confidence β₯ 0.8
- No speculative language ("maybe", "might", "someday")
- Valid intent type (profile: identity/preference/event, session: reference/task)
- Must have key + value
- Stored with critic confidence level (low β rejected, medium/high β stored)
- Deduplication: 80% token overlap β skip
Retrieval pipeline (two-stage hybrid):
- Stage 1 β Vector search:
all-MiniLM-L6-v2embeddings β cosine similarity β top 5 candidates - Stage 2 β Keyword re-rank: Weighted Jaccard (0.8 facts + 0.2 summary) + synonym expansion + substring boost + time decay (48h half-life) + confidence multiplier
- Blended score: 0.6 Γ keyword + 0.4 Γ vector
- Semantic dedup: 80% overlap between results β keep higher-scoring one
Complex tasks like "plan a trip to Japan" or "compare React vs Vue" are handled by a separate execution engine:
- Aspect decomposition: LLM breaks the goal into prioritized aspects (e.g., flights, hotels, places, budget, itinerary)
- Adaptive execution: Each aspect is researched with tool calls, with retry + alternative query generation on failure
- Coverage gate: All required aspects must have data before synthesis β missing aspects trigger retry
- Reasoning synthesis: Budget breakdowns, comparisons, itineraries are synthesized from collected data
- Structured output: Final answer is organized by section headers, not flat paragraphs
Limits: max 12 steps, max 6 tool calls, max 2 retries per aspect, max 7 aspects.
| Tool | Source | What It Does |
|---|---|---|
web_search |
DuckDuckGo | Search the internet for current information |
web_reader |
httpx + BeautifulSoup | Extract content from specific URLs |
weather |
OpenWeatherMap API | Current weather for any city |
calculator |
Python eval (sandboxed) |
Math expressions, sqrt, log, trig, etc. |
wikipedia |
Wikipedia API | Encyclopedic summaries |
dictionary |
Free Dictionary API | Definitions, phonetics, examples |
translate |
MyMemory API | Translation between 30+ languages |
unit_converter |
Built-in conversion tables | Length, weight, temperature, volume, etc. |
datetime |
Python datetime |
Current time, timezone conversion, date math |
All tools are registered via ToolRegistry with schema validation (required inputs, type checking) before execution.
The Groq client includes production-grade resilience:
- Multi-key rotation: 5 API keys with most-rested-first selection (not round-robin)
- Proactive throttling: 1s minimum gap per key (prevents 429s before they happen)
- Automatic retry: Parses Groq's
Retry-Afterheader, falls back to exponential backoff - Up to 5 retries with key rotation on rate limit
The heuristic classifier (classify_query()) uses 200+ lines of regex patterns. ToolForge replaces it with a fine-tuned model:
| Metric | Heuristic | ToolForge Model |
|---|---|---|
| Accuracy | ~75% | 86.2% |
| Approach | Regex patterns | QLoRA fine-tuned Qwen2.5-7B |
| Training data | β | 1,173 synthetic examples (Gemini distillation) |
| Ablation runs | β | 4 (tracked on W&B) |
| Latency | 0ms (regex) | ~200ms (GPU inference) |
The integration is a feature flag β set TOOLFORGE_ENABLED=true + provide the adapter path. Falls back gracefully if GPU/dependencies unavailable.
| Layer | Technology |
|---|---|
| Backend | Python 3.12, FastAPI, Uvicorn |
| LLM | Groq API (Llama 3.1 8B Instant) |
| Frontend | React 19, Vite, vanilla CSS |
| Embeddings | sentence-transformers (all-MiniLM-L6-v2) |
| Tool Router | QLoRA fine-tuned Qwen2.5-7B (optional) |
| Deployment | HuggingFace Spaces (Docker) + Vercel |
| Monitoring | Structured logging, full execution traces |
autonomous-agent/
βββ app/
β βββ main.py # FastAPI entrypoint
β βββ config.py # Settings from environment
β βββ agent/
β β βββ agent.py # Core agent loop (4-step, 2-tool limit)
β β βββ executor.py # Planner-Executor pipeline
β β βββ planner.py # LLM-powered query decomposition
β β βββ critic.py # 5-axis answer evaluation
β β βββ autonomous_executor.py # Goal-driven multi-step execution
β β βββ memory_analyzer.py # LLM-based fact extraction (4 quality gates)
β β βββ parser.py # Robust JSON parser for LLM outputs
β β βββ toolforge_router.py # Fine-tuned model router (feature flag)
β βββ llm/
β β βββ groq_client.py # Multi-key rotation + throttling
β βββ memory/
β β βββ memory_store.py # Hybrid RAG (vector + keyword)
β β βββ embedding.py # Sentence-transformer embeddings
β β βββ vector_store.py # In-memory HNSW vector index
β βββ tools/ # 9 tool implementations
β β βββ registry.py # Schema validation + execution
β β βββ search_tool.py # DuckDuckGo web search
β β βββ web_reader.py # URL content extraction
β β βββ weather_tool.py # OpenWeatherMap integration
β β βββ calculator_tool.py # Sandboxed math evaluation
β β βββ wikipedia_tool.py # Wikipedia summaries
β β βββ dictionary_tool.py # Word definitions
β β βββ translation_tool.py # Multi-language translation
β β βββ unit_converter_tool.py # Unit conversion tables
β β βββ datetime_tool.py # Timezone-aware date/time
β βββ services/
β β βββ agent_service.py # Wires everything together
β βββ schemas/
β βββ request.py # Pydantic request/response models
βββ frontend/ # React + Vite
β βββ src/
β βββ components/
β β βββ ChatUI.jsx # Chat interface
β β βββ AgentPanel.jsx # Live execution trace
β β βββ FloatingBackground.jsx # Animated particles
β βββ services/ # API client
βββ tests/
β βββ test_agent_features.py # Unit tests (agent, tools, memory)
β βββ test_prompts.py # Prompt-level regression tests
βββ Dockerfile # Production container
βββ Procfile # HuggingFace Spaces entrypoint
βββ requirements.txt
# Clone
git clone https://github.com/ayushh0110/autonomous-agent.git
cd autonomous-agent
# Install
pip install -r requirements.txt
# Configure
cp app/.env.example app/.env
# Add your GROQ_API_KEY(s)
# Run
uvicorn app.main:app --reload --port 7860cd frontend
npm install
npm run dev# In app/.env
TOOLFORGE_ENABLED=true
TOOLFORGE_ADAPTER_PATH=./checkpoints/qwen7b-r64-lr2e4/final{
"query": "What's the weather in Tokyo?",
"profile_context": [{"key": "name", "value": "Ayush"}],
"session_context": [{"intent": "task", "value": "trip planning"}]
}Response:
{
"response": "Hey! So Tokyo right now is around 18Β°C with clear skies...",
"source": "planner_executor",
"tools_used": ["weather"],
"steps_taken": 2,
"plan": ["Get weather data for Tokyo", "Format into conversational response"],
"confidence": "high",
"refinements": 0,
"memory_used": false,
"decision": "needs_search",
"llm_calls": 4,
"early_stopped": false,
"cache_hits": 0,
"memory_extraction": null
}| Phase | Focus | Key Addition |
|---|---|---|
| 1 | Foundation | FastAPI + Groq + basic agent loop |
| 2 | Tool System | 9 tools + registry + schema validation |
| 2.1 | Safety | Tool limits, final step enforcement, parse retry |
| 3 | Planning | Planner-Executor pipeline + synthesis |
| 3.2 | Quality | Critic (5-axis eval) + severity-aware refinement |
| 4 | Intelligence | Query classifier, early stopping, dynamic planning, tool cache |
| 5 | Memory | Hybrid RAG, memory analyzer, 4 quality gates |
| 6 | Autonomy | AutonomousExecutor for multi-step tasks |
| 7 | Fine-Tuning | ToolForge integration (QLoRA Qwen2.5-7B router) |
MIT