Multi-Agent Forecast Assistant

A multi-agent AI system built with LangGraph that answers natural-language questions about demand forecasting, bridging the gap between ML model outputs and business understanding.

Demo

Why This Exists

Data science teams spend a disproportionate amount of time explaining model outputs to stakeholders, answering the same questions about methodology, performance, and forecast rationale. This assistant eliminates that bottleneck. Business users ask questions in plain language and get answers with supporting data and charts in seconds, not days. The result: faster decisions, higher forecast adoption, and data scientists freed to build rather than explain.

Goes beyond dashboards. A Tableau dashboard can show what the forecast is. It cannot explain why the model chose that number, how it compares to alternatives, or answer a follow-up question the dashboard designer didn't anticipate. Dashboards are static. They answer the questions you thought to ask at design time. This assistant answers the questions that come up in the moment: the ad-hoc, unstructured, context-dependent questions that actually block decisions. No filters to click, no tabs to navigate, no training required. Just ask. And when a new model is added to the system, the assistant handles it immediately without anyone rebuilding dashboard views.

Promotes adoption of data science products. A risk to any ML project is that nobody uses it. Stakeholders override forecasts they don't understand. This assistant gives them a way to interrogate the system: Why was LightGBM chosen over Prophet? How did it perform on Category X? Show me the backtest results. When people can ask questions and get clear answers, they trust the output. When they trust the output, they use it, and the ROI of the entire forecasting investment multiplies.

How It Works

A supervisor agent interprets each question and routes it to the right specialist:

RAG Agent: answers methodology and concept questions (e.g., "How does walk-forward backtesting work?") using hybrid retrieval (BM25 + semantic search) over 6 knowledge-base documents
SQL Agent: answers data and comparison questions (e.g., "Which model has the lowest RMSE?") by writing SQL queries, running calculations, and generating charts

Three layers of guardrails validate input, SQL execution, and output. These check for PII, prompt injection, off-topic queries, SQL injection, hallucination, and low-confidence responses.

Architecture

Evaluation Results

Improved from 65% to 100% pass rate over 6 iterations.

Early evaluation identified 5 RAG failure types: vocabulary mismatch, wrong document retrieved, context contamination, incomplete retrieval, and missing methodology. These were resolved through hybrid retrieval (BM25 + semantic), structure-aware chunking (MarkdownNodeParser), query rewriting with routing prefixes, and stronger prompts with database schema descriptions and SQL query examples.

29 test questions across 6 categories, scored 1 to 5 by an LLM-as-judge framework:

Category	Questions	Avg Score	Pass Rate
Explicit Retrieval	7	5.00	100%
Reasoning	5	4.60	100%
Data / SQL	7	4.86	100%
Visualization	2	5.00	100%
Out-of-Scope	4	4.50	100%
Adversarial	4	4.75	100%
Overall	29	4.72	100%

Routing accuracy: 96% (25/26 routable questions correctly assigned).

What Makes This Project Different

Real forecasting domain: Built on M5 competition data (Walmart demand forecasting with 9 ML/DL models), not generic documents or toy examples.
Multi-agent with 4 specialized tools: Goes beyond simple RAG: rag_tool for knowledge retrieval, sql_tool for querying forecast results, computation_tool for arithmetic on metrics, viz_tool for generating bar and line charts.
3-layer production guardrails: Input validation (PII, prompt injection, off-topic), tool-level safety (SQL injection prevention), and output checks (hallucination, confidence, attribution).
LLM-as-judge evaluation: 29 test questions across 6 categories (explicit retrieval, reasoning, data/SQL, visualization, out-of-scope, adversarial) with 100% pass rate.
Speed-optimized inference: Conditional LLM calls reduce round-trips from 5-6 to 2-3 per question: regex replaces LLM for injection detection, hallucination check is skipped when agent attribution exists, and rule-based query rewriting handles clear questions without an LLM call.

Design Decisions

Decision	Why
LangGraph over basic LangChain agents	Explicit state graph gives control over routing, memory, and tool execution. The supervisor pattern cleanly separates concerns between specialist agents.
Hybrid retrieval (BM25 + semantic)	Semantic search alone misses exact keyword matches (e.g., model names in results tables). BM25 catches these. Combined retrieval fixed 3 of 5 identified RAG failure types.
MarkdownNodeParser for chunking	LlamaIndex's markdown-aware parser chunks by document structure (headers/sections), eliminating context contamination and incomplete retrieval caused by naive fixed-size chunking.
Direct agent response extraction	The supervisor tends to paraphrase or swallow agent answers. Extracting responses directly from the message history preserves the agent's exact output.
Conditional LLM calls	Not every query needs every check. Regex handles injection detection; hallucination check is skipped when agent attribution exists; query rewriting uses rules for clear questions. This halves response time.
3 separate LLM models	Supervisor, RAG agent, and SQL agent each use a different model optimized for their role (routing, retrieval synthesis, code generation).
Externalized prompts and config	All 8 LLM prompts live in `prompts.py` and all settings in `config.yaml`, so tuning doesn't require touching application logic.

Lessons Learned

Semantic search alone is insufficient for RAG: Embedding similarity missed keyword matches in structured content (e.g., results tables listing model names), causing vocabulary mismatch, wrong-document retrieval, and missing methodology. Adding BM25 as a complement fixed this: semantic search finds conceptually similar content while BM25 catches exact term matches. Combined, they cover each other's blind spots.

Naive chunking causes data quality issues: Fixed-character-count chunking merged unrelated sections into one chunk (context contamination) and split related content across chunks (incomplete retrieval). Switching to MarkdownNodeParser, which chunks by document headers and sections, eliminated both problems.

Supervisors paraphrase agent responses: The LangGraph supervisor rewrote or truncated specialist answers before returning them. The fix was architectural, not prompt-based: walking the message history in reverse and extracting the actual agent response directly.

Ambiguous questions cause routing errors: Routing accuracy was 75% before adding query rewriting. Prefixing rewritten queries with [KNOWLEDGE BASE QUESTION] or [DATA QUESTION] improved accuracy to 96%.

LLM calls are the main latency bottleneck: Each LLM guardrail check adds a full API round-trip. Auditing every call and replacing with regex or rules where possible cut total calls from 5-6 to 2-3 per question, roughly halving response time.

Extract chart paths from ToolMessages, not LLM text: LLMs inconsistently echo file paths in their responses. Scanning ToolMessage objects in the message history is reliable; relying on the agent's text output is not.

Tech Stack

Component	Technology
LLM	OpenAI GPT models (3 separate models for supervisor, RAG, SQL agents)
Orchestration	LangGraph + langgraph-supervisor + MemorySaver checkpoint
RAG	LlamaIndex MarkdownNodeParser + LangChain Chroma + BM25
Agents	LangChain tools + LangChain agents
Database	SQLite (forecasts, metrics, hyperparameters)
Visualization	matplotlib
UI	Streamlit
Configuration	YAML + python-dotenv

Project Structure

multi-agent-forecast-assistant/
├── config.yaml              # All settings: LLM models, guardrails, paths, patterns
├── requirements.txt
├── .env                     # API keys (not committed)
├── .gitignore
├── README.md
├── src/
│   ├── app.py               # Streamlit chat interface
│   ├── agents.py            # AgentFactory: supervisor + RAG/SQL agents
│   ├── tools.py             # AgentTools: rag, sql, computation, viz tools
│   ├── guardrails.py        # Input, tool, and output guardrail classes
│   ├── prompts.py           # All 8 LLM prompts centralized
│   ├── utils.py             # Query rewriting + response extraction
│   ├── config.py            # Loads config.yaml, initializes LLMs and logger
│   ├── build_database.py    # ETL: CSVs + JSON → SQLite
│   ├── evaluation.py        # LLM-as-judge evaluation framework
│   └── test_questions.py    # 29 test questions across 6 categories
├── knowledge_base/          # RAG source documents (6 Markdown files)
├── data/                    # Source CSVs + generated SQLite database
├── eval_results/            # Evaluation reports and scores
└── plots/                   # Generated charts + demo GIF

Setup and Run

Prerequisites

Python 3.11+
OpenAI API key

Installation

git clone https://github.com/yedanzhang-ai/multi-agent-forecast-assistant.git
cd multi-agent-forecast-assistant

python -m venv venv
source venv/Scripts/activate   # Windows
# source venv/bin/activate     # macOS/Linux

pip install -r requirements.txt

Configuration

Create a .env file in the project root:

OPENAI_API_KEY=your-api-key-here

Build Database

python src/build_database.py

Run the App

streamlit run src/app.py

Run Evaluation

python src/evaluation.py

Future Work

Expand test coverage. Scale from 29 to 60+ questions with multi-agent edge cases (ambiguous routing, combined RAG+SQL queries, simultaneous guardrail triggers).
Streaming responses. Stream answers token-by-token instead of waiting for full completion.
Multi-threaded evaluation. Run test questions in parallel for faster feedback cycles.
More chart types. Add scatter plots, heatmaps, and multi-panel charts.
User authentication. Add login and rate limiting.
Probabilistic forecasts. Extend to prediction intervals, not just point forecasts.
SHAP explanations. Add a model interpretability tool for feature importance visualization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Agent Forecast Assistant

Demo

Why This Exists

How It Works

Architecture

Evaluation Results

What Makes This Project Different

Design Decisions

Lessons Learned

Tech Stack

Project Structure

Setup and Run

Prerequisites

Installation

Configuration

Build Database

Run the App

Run Evaluation

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
eval_results		eval_results
knowledge_base		knowledge_base
plots		plots
src		src
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Multi-Agent Forecast Assistant

Demo

Why This Exists

How It Works

Architecture

Evaluation Results

What Makes This Project Different

Design Decisions

Lessons Learned

Tech Stack

Project Structure

Setup and Run

Prerequisites

Installation

Configuration

Build Database

Run the App

Run Evaluation

Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages