Skip to content

diegocolpal/multi-agent-forecast-assistant

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Agent Forecast Assistant

A multi-agent AI system built with LangGraph that answers natural-language questions about demand forecasting, bridging the gap between ML model outputs and business understanding.

Demo

Demo

Why This Exists

Data science teams spend a disproportionate amount of time explaining model outputs to stakeholders, answering the same questions about methodology, performance, and forecast rationale. This assistant eliminates that bottleneck. Business users ask questions in plain language and get answers with supporting data and charts in seconds, not days. The result: faster decisions, higher forecast adoption, and data scientists freed to build rather than explain.

Goes beyond dashboards. A Tableau dashboard can show what the forecast is. It cannot explain why the model chose that number, how it compares to alternatives, or answer a follow-up question the dashboard designer didn't anticipate. Dashboards are static. They answer the questions you thought to ask at design time. This assistant answers the questions that come up in the moment: the ad-hoc, unstructured, context-dependent questions that actually block decisions. No filters to click, no tabs to navigate, no training required. Just ask. And when a new model is added to the system, the assistant handles it immediately without anyone rebuilding dashboard views.

Promotes adoption of data science products. A risk to any ML project is that nobody uses it. Stakeholders override forecasts they don't understand. This assistant gives them a way to interrogate the system: Why was LightGBM chosen over Prophet? How did it perform on Category X? Show me the backtest results. When people can ask questions and get clear answers, they trust the output. When they trust the output, they use it, and the ROI of the entire forecasting investment multiplies.

How It Works

A supervisor agent interprets each question and routes it to the right specialist:

  • RAG Agent: answers methodology and concept questions (e.g., "How does walk-forward backtesting work?") using hybrid retrieval (BM25 + semantic search) over 6 knowledge-base documents
  • SQL Agent: answers data and comparison questions (e.g., "Which model has the lowest RMSE?") by writing SQL queries, running calculations, and generating charts

Three layers of guardrails validate input, SQL execution, and output. These check for PII, prompt injection, off-topic queries, SQL injection, hallucination, and low-confidence responses.

Architecture

Architecture

Evaluation Results

Improved from 65% to 100% pass rate over 6 iterations.

Early evaluation identified 5 RAG failure types: vocabulary mismatch, wrong document retrieved, context contamination, incomplete retrieval, and missing methodology. These were resolved through hybrid retrieval (BM25 + semantic), structure-aware chunking (MarkdownNodeParser), query rewriting with routing prefixes, and stronger prompts with database schema descriptions and SQL query examples.

29 test questions across 6 categories, scored 1 to 5 by an LLM-as-judge framework:

Category Questions Avg Score Pass Rate
Explicit Retrieval 7 5.00 100%
Reasoning 5 4.60 100%
Data / SQL 7 4.86 100%
Visualization 2 5.00 100%
Out-of-Scope 4 4.50 100%
Adversarial 4 4.75 100%
Overall 29 4.72 100%

Routing accuracy: 96% (25/26 routable questions correctly assigned).

What Makes This Project Different

  • Real forecasting domain: Built on M5 competition data (Walmart demand forecasting with 9 ML/DL models), not generic documents or toy examples.
  • Multi-agent with 4 specialized tools: Goes beyond simple RAG: rag_tool for knowledge retrieval, sql_tool for querying forecast results, computation_tool for arithmetic on metrics, viz_tool for generating bar and line charts.
  • 3-layer production guardrails: Input validation (PII, prompt injection, off-topic), tool-level safety (SQL injection prevention), and output checks (hallucination, confidence, attribution).
  • LLM-as-judge evaluation: 29 test questions across 6 categories (explicit retrieval, reasoning, data/SQL, visualization, out-of-scope, adversarial) with 100% pass rate.
  • Speed-optimized inference: Conditional LLM calls reduce round-trips from 5-6 to 2-3 per question: regex replaces LLM for injection detection, hallucination check is skipped when agent attribution exists, and rule-based query rewriting handles clear questions without an LLM call.

Design Decisions

Decision Why
LangGraph over basic LangChain agents Explicit state graph gives control over routing, memory, and tool execution. The supervisor pattern cleanly separates concerns between specialist agents.
Hybrid retrieval (BM25 + semantic) Semantic search alone misses exact keyword matches (e.g., model names in results tables). BM25 catches these. Combined retrieval fixed 3 of 5 identified RAG failure types.
MarkdownNodeParser for chunking LlamaIndex's markdown-aware parser chunks by document structure (headers/sections), eliminating context contamination and incomplete retrieval caused by naive fixed-size chunking.
Direct agent response extraction The supervisor tends to paraphrase or swallow agent answers. Extracting responses directly from the message history preserves the agent's exact output.
Conditional LLM calls Not every query needs every check. Regex handles injection detection; hallucination check is skipped when agent attribution exists; query rewriting uses rules for clear questions. This halves response time.
3 separate LLM models Supervisor, RAG agent, and SQL agent each use a different model optimized for their role (routing, retrieval synthesis, code generation).
Externalized prompts and config All 8 LLM prompts live in prompts.py and all settings in config.yaml, so tuning doesn't require touching application logic.

Lessons Learned

Semantic search alone is insufficient for RAG: Embedding similarity missed keyword matches in structured content (e.g., results tables listing model names), causing vocabulary mismatch, wrong-document retrieval, and missing methodology. Adding BM25 as a complement fixed this: semantic search finds conceptually similar content while BM25 catches exact term matches. Combined, they cover each other's blind spots.

Naive chunking causes data quality issues: Fixed-character-count chunking merged unrelated sections into one chunk (context contamination) and split related content across chunks (incomplete retrieval). Switching to MarkdownNodeParser, which chunks by document headers and sections, eliminated both problems.

Supervisors paraphrase agent responses: The LangGraph supervisor rewrote or truncated specialist answers before returning them. The fix was architectural, not prompt-based: walking the message history in reverse and extracting the actual agent response directly.

Ambiguous questions cause routing errors: Routing accuracy was 75% before adding query rewriting. Prefixing rewritten queries with [KNOWLEDGE BASE QUESTION] or [DATA QUESTION] improved accuracy to 96%.

LLM calls are the main latency bottleneck: Each LLM guardrail check adds a full API round-trip. Auditing every call and replacing with regex or rules where possible cut total calls from 5-6 to 2-3 per question, roughly halving response time.

Extract chart paths from ToolMessages, not LLM text: LLMs inconsistently echo file paths in their responses. Scanning ToolMessage objects in the message history is reliable; relying on the agent's text output is not.

Tech Stack

Component Technology
LLM OpenAI GPT models (3 separate models for supervisor, RAG, SQL agents)
Orchestration LangGraph + langgraph-supervisor + MemorySaver checkpoint
RAG LlamaIndex MarkdownNodeParser + LangChain Chroma + BM25
Agents LangChain tools + LangChain agents
Database SQLite (forecasts, metrics, hyperparameters)
Visualization matplotlib
UI Streamlit
Configuration YAML + python-dotenv

Project Structure

multi-agent-forecast-assistant/
├── config.yaml              # All settings: LLM models, guardrails, paths, patterns
├── requirements.txt
├── .env                     # API keys (not committed)
├── .gitignore
├── README.md
├── src/
│   ├── app.py               # Streamlit chat interface
│   ├── agents.py            # AgentFactory: supervisor + RAG/SQL agents
│   ├── tools.py             # AgentTools: rag, sql, computation, viz tools
│   ├── guardrails.py        # Input, tool, and output guardrail classes
│   ├── prompts.py           # All 8 LLM prompts centralized
│   ├── utils.py             # Query rewriting + response extraction
│   ├── config.py            # Loads config.yaml, initializes LLMs and logger
│   ├── build_database.py    # ETL: CSVs + JSON → SQLite
│   ├── evaluation.py        # LLM-as-judge evaluation framework
│   └── test_questions.py    # 29 test questions across 6 categories
├── knowledge_base/          # RAG source documents (6 Markdown files)
├── data/                    # Source CSVs + generated SQLite database
├── eval_results/            # Evaluation reports and scores
└── plots/                   # Generated charts + demo GIF

Setup and Run

Prerequisites

  • Python 3.11+
  • OpenAI API key

Installation

git clone https://github.com/yedanzhang-ai/multi-agent-forecast-assistant.git
cd multi-agent-forecast-assistant

python -m venv venv
source venv/Scripts/activate   # Windows
# source venv/bin/activate     # macOS/Linux

pip install -r requirements.txt

Configuration

Create a .env file in the project root:

OPENAI_API_KEY=your-api-key-here

Build Database

python src/build_database.py

Run the App

streamlit run src/app.py

Run Evaluation

python src/evaluation.py

Future Work

  • Expand test coverage. Scale from 29 to 60+ questions with multi-agent edge cases (ambiguous routing, combined RAG+SQL queries, simultaneous guardrail triggers).
  • Streaming responses. Stream answers token-by-token instead of waiting for full completion.
  • Multi-threaded evaluation. Run test questions in parallel for faster feedback cycles.
  • More chart types. Add scatter plots, heatmaps, and multi-panel charts.
  • User authentication. Add login and rate limiting.
  • Probabilistic forecasts. Extend to prediction intervals, not just point forecasts.
  • SHAP explanations. Add a model interpretability tool for feature importance visualization.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%