A multi-agent AI system built with LangGraph that answers natural-language questions about demand forecasting, bridging the gap between ML model outputs and business understanding.
Data science teams spend a disproportionate amount of time explaining model outputs to stakeholders, answering the same questions about methodology, performance, and forecast rationale. This assistant eliminates that bottleneck. Business users ask questions in plain language and get answers with supporting data and charts in seconds, not days. The result: faster decisions, higher forecast adoption, and data scientists freed to build rather than explain.
Goes beyond dashboards. A Tableau dashboard can show what the forecast is. It cannot explain why the model chose that number, how it compares to alternatives, or answer a follow-up question the dashboard designer didn't anticipate. Dashboards are static. They answer the questions you thought to ask at design time. This assistant answers the questions that come up in the moment: the ad-hoc, unstructured, context-dependent questions that actually block decisions. No filters to click, no tabs to navigate, no training required. Just ask. And when a new model is added to the system, the assistant handles it immediately without anyone rebuilding dashboard views.
Promotes adoption of data science products. A risk to any ML project is that nobody uses it. Stakeholders override forecasts they don't understand. This assistant gives them a way to interrogate the system: Why was LightGBM chosen over Prophet? How did it perform on Category X? Show me the backtest results. When people can ask questions and get clear answers, they trust the output. When they trust the output, they use it, and the ROI of the entire forecasting investment multiplies.
A supervisor agent interprets each question and routes it to the right specialist:
- RAG Agent: answers methodology and concept questions (e.g., "How does walk-forward backtesting work?") using hybrid retrieval (BM25 + semantic search) over 6 knowledge-base documents
- SQL Agent: answers data and comparison questions (e.g., "Which model has the lowest RMSE?") by writing SQL queries, running calculations, and generating charts
Three layers of guardrails validate input, SQL execution, and output. These check for PII, prompt injection, off-topic queries, SQL injection, hallucination, and low-confidence responses.
Improved from 65% to 100% pass rate over 6 iterations.
Early evaluation identified 5 RAG failure types: vocabulary mismatch, wrong document retrieved, context contamination, incomplete retrieval, and missing methodology. These were resolved through hybrid retrieval (BM25 + semantic), structure-aware chunking (MarkdownNodeParser), query rewriting with routing prefixes, and stronger prompts with database schema descriptions and SQL query examples.
29 test questions across 6 categories, scored 1 to 5 by an LLM-as-judge framework:
| Category | Questions | Avg Score | Pass Rate |
|---|---|---|---|
| Explicit Retrieval | 7 | 5.00 | 100% |
| Reasoning | 5 | 4.60 | 100% |
| Data / SQL | 7 | 4.86 | 100% |
| Visualization | 2 | 5.00 | 100% |
| Out-of-Scope | 4 | 4.50 | 100% |
| Adversarial | 4 | 4.75 | 100% |
| Overall | 29 | 4.72 | 100% |
Routing accuracy: 96% (25/26 routable questions correctly assigned).
- Real forecasting domain: Built on M5 competition data (Walmart demand forecasting with 9 ML/DL models), not generic documents or toy examples.
- Multi-agent with 4 specialized tools: Goes beyond simple RAG:
rag_toolfor knowledge retrieval,sql_toolfor querying forecast results,computation_toolfor arithmetic on metrics,viz_toolfor generating bar and line charts. - 3-layer production guardrails: Input validation (PII, prompt injection, off-topic), tool-level safety (SQL injection prevention), and output checks (hallucination, confidence, attribution).
- LLM-as-judge evaluation: 29 test questions across 6 categories (explicit retrieval, reasoning, data/SQL, visualization, out-of-scope, adversarial) with 100% pass rate.
- Speed-optimized inference: Conditional LLM calls reduce round-trips from 5-6 to 2-3 per question: regex replaces LLM for injection detection, hallucination check is skipped when agent attribution exists, and rule-based query rewriting handles clear questions without an LLM call.
| Decision | Why |
|---|---|
| LangGraph over basic LangChain agents | Explicit state graph gives control over routing, memory, and tool execution. The supervisor pattern cleanly separates concerns between specialist agents. |
| Hybrid retrieval (BM25 + semantic) | Semantic search alone misses exact keyword matches (e.g., model names in results tables). BM25 catches these. Combined retrieval fixed 3 of 5 identified RAG failure types. |
| MarkdownNodeParser for chunking | LlamaIndex's markdown-aware parser chunks by document structure (headers/sections), eliminating context contamination and incomplete retrieval caused by naive fixed-size chunking. |
| Direct agent response extraction | The supervisor tends to paraphrase or swallow agent answers. Extracting responses directly from the message history preserves the agent's exact output. |
| Conditional LLM calls | Not every query needs every check. Regex handles injection detection; hallucination check is skipped when agent attribution exists; query rewriting uses rules for clear questions. This halves response time. |
| 3 separate LLM models | Supervisor, RAG agent, and SQL agent each use a different model optimized for their role (routing, retrieval synthesis, code generation). |
| Externalized prompts and config | All 8 LLM prompts live in prompts.py and all settings in config.yaml, so tuning doesn't require touching application logic. |
Semantic search alone is insufficient for RAG: Embedding similarity missed keyword matches in structured content (e.g., results tables listing model names), causing vocabulary mismatch, wrong-document retrieval, and missing methodology. Adding BM25 as a complement fixed this: semantic search finds conceptually similar content while BM25 catches exact term matches. Combined, they cover each other's blind spots.
Naive chunking causes data quality issues: Fixed-character-count chunking merged unrelated sections into one chunk (context contamination) and split related content across chunks (incomplete retrieval). Switching to MarkdownNodeParser, which chunks by document headers and sections, eliminated both problems.
Supervisors paraphrase agent responses: The LangGraph supervisor rewrote or truncated specialist answers before returning them. The fix was architectural, not prompt-based: walking the message history in reverse and extracting the actual agent response directly.
Ambiguous questions cause routing errors: Routing accuracy was 75% before adding query rewriting. Prefixing rewritten queries with [KNOWLEDGE BASE QUESTION] or [DATA QUESTION] improved accuracy to 96%.
LLM calls are the main latency bottleneck: Each LLM guardrail check adds a full API round-trip. Auditing every call and replacing with regex or rules where possible cut total calls from 5-6 to 2-3 per question, roughly halving response time.
Extract chart paths from ToolMessages, not LLM text: LLMs inconsistently echo file paths in their responses. Scanning ToolMessage objects in the message history is reliable; relying on the agent's text output is not.
| Component | Technology |
|---|---|
| LLM | OpenAI GPT models (3 separate models for supervisor, RAG, SQL agents) |
| Orchestration | LangGraph + langgraph-supervisor + MemorySaver checkpoint |
| RAG | LlamaIndex MarkdownNodeParser + LangChain Chroma + BM25 |
| Agents | LangChain tools + LangChain agents |
| Database | SQLite (forecasts, metrics, hyperparameters) |
| Visualization | matplotlib |
| UI | Streamlit |
| Configuration | YAML + python-dotenv |
multi-agent-forecast-assistant/
├── config.yaml # All settings: LLM models, guardrails, paths, patterns
├── requirements.txt
├── .env # API keys (not committed)
├── .gitignore
├── README.md
├── src/
│ ├── app.py # Streamlit chat interface
│ ├── agents.py # AgentFactory: supervisor + RAG/SQL agents
│ ├── tools.py # AgentTools: rag, sql, computation, viz tools
│ ├── guardrails.py # Input, tool, and output guardrail classes
│ ├── prompts.py # All 8 LLM prompts centralized
│ ├── utils.py # Query rewriting + response extraction
│ ├── config.py # Loads config.yaml, initializes LLMs and logger
│ ├── build_database.py # ETL: CSVs + JSON → SQLite
│ ├── evaluation.py # LLM-as-judge evaluation framework
│ └── test_questions.py # 29 test questions across 6 categories
├── knowledge_base/ # RAG source documents (6 Markdown files)
├── data/ # Source CSVs + generated SQLite database
├── eval_results/ # Evaluation reports and scores
└── plots/ # Generated charts + demo GIF
- Python 3.11+
- OpenAI API key
git clone https://github.com/yedanzhang-ai/multi-agent-forecast-assistant.git
cd multi-agent-forecast-assistant
python -m venv venv
source venv/Scripts/activate # Windows
# source venv/bin/activate # macOS/Linux
pip install -r requirements.txtCreate a .env file in the project root:
OPENAI_API_KEY=your-api-key-here
python src/build_database.pystreamlit run src/app.pypython src/evaluation.py- Expand test coverage. Scale from 29 to 60+ questions with multi-agent edge cases (ambiguous routing, combined RAG+SQL queries, simultaneous guardrail triggers).
- Streaming responses. Stream answers token-by-token instead of waiting for full completion.
- Multi-threaded evaluation. Run test questions in parallel for faster feedback cycles.
- More chart types. Add scatter plots, heatmaps, and multi-panel charts.
- User authentication. Add login and rate limiting.
- Probabilistic forecasts. Extend to prediction intervals, not just point forecasts.
- SHAP explanations. Add a model interpretability tool for feature importance visualization.

