GitHub - M-HuangX/Firn: Auditable AI agent for financial analysis — 3-phase audit pipeline with deterministic verdicts, multi-agent equity research, and versioned research memory.

Every claim traced. Every source verified. Every verdict deterministic.

The Problem · How Firn Audits · The Pipeline · Architecture · Web UI · Quick Start · audit-fence ↗

Want auditable AI in your own system? The enforcement pattern behind Firn has been extracted into audit-fence — a standalone Python library for automated hallucination detection. Wraps around any agent pipeline, captures production traces, and runs a post-hoc audit agent that programmatically verifies every citation against real source data. Model-agnostic, framework-compatible, MIT licensed. pip install audit-fence.

The Problem

AI agents are increasingly used to generate financial reports that humans act on. But every LLM output carries a fundamental risk: hallucination. A report that states "revenue grew 26% YoY" sounds authoritative — but did the model compute that from real data, or confabulate it?

You can't assign a human to audit every AI output — it's too expensive and defeats the purpose of automation. So most systems let the AI cite its own sources: tools like Deep Research, Gemini, and ChatGPT generate a report and attach their own citations in the same pass. The problem: the same model that might hallucinate the content is also providing the citations. When it says "Source: get_income_statement → $5.1B", you're trusting the model's claim about what it retrieved — not a verified record of what it actually retrieved.

A better approach: use an independent audit agent to verify the report after the fact. But this just moves the problem — the audit agent is itself an LLM, and it can fabricate evidence just as easily. Saying "I checked and it's correct" means nothing if the checking itself is unverifiable.

Firn solves this with programmatic enforcement: the audit agent searches for evidence, but every piece of evidence it attempts to record is machine-verified against its actual search history before being accepted. The agent cannot assert "I found this" — the system checks whether it actually searched for it, whether it quoted the report verbatim (not a paraphrase), and whether the source document actually contains what the agent claims. Every acceptance and rejection is logged — a compliance officer can review not just what passed, but what failed and why.

How Firn Audits

Firn is a multi-agent system: 4 specialist agents collect market data in parallel (fundamentals, technicals, valuation, macro), a core agent synthesizes them into a research report, and then an independent audit pipeline verifies every factual claim in that report.

The audit doesn't ask "is this report good?" — it asks, for each specific claim: "where exactly did this number come from, and does the source actually say that?"

An LLM audit agent searches for evidence, but it can only record what it actually found — the system rejects any evidence that doesn't match the agent's real search results. Three roles, strictly separated: the LLM searches, code enforces, deterministic logic judges.

What makes this different

1. The auditor is itself audited — by code

This is the core idea. An audit agent that can freely assert "I checked and it's correct" is no better than the agent it audits — it can hallucinate evidence just as easily. Firn solves this by constraining the audit agent at the tool level: every piece of evidence must pass three programmatic verification layers before being accepted.

Layer	What it checks	How
Grep Evidence Verification	"Did you actually search for this, or are you quoting from memory?"	Every `record_()` call is cross-checked against the agent's real `grep_trace()` history. If the evidence text doesn't match any recent grep result → rejected*.
Report Text Verification	"Does the sentence you're pointing to actually exist in the report?"	`claim_in_report` must be a normalized substring of the actual report. Paraphrases → rejected.
Specialist Output Verification	"Did the specialist actually say this?"	During the specialist fidelity phase, every claim attributed to a specialist is verified against the specialist's actual output file. Fabrications → rejected.

This design also addresses an LLM architecture constraint: when the report, source data, and reasoning traces are all loaded into one context window, information in the middle is most prone to hallucination. By requiring a fresh grep search before each evidence submission, the system ensures the relevant evidence sits at the tail of the context window — where attention is strongest. The enforcement is not just a policy check; it structurally reduces the conditions under which hallucination occurs.

All rejections are logged to enforcement_log.jsonl — a compliance officer can inspect not just what was accepted, but what was rejected and why.

Hallucination is inevitable — even humans have it. But through programmatic enforcement, we can dramatically reduce it while providing the auditability and traceability that regulated industries need.

Use this in your own pipeline. The enforcement, trace capture, and audit agent have been extracted into audit-fence — a standalone library that wraps around any agent system. It captures tool call traces from your production pipeline, then runs a separate audit agent that verifies every claim against the real source data. No framework lock-in, no Firn dependency required.

2. Full chain of custody

Every citation in the final output includes the complete provenance chain:

Report: "trailing P/E of 18.9x"
  └─ Specialist (fundamental): "P/E Ratio: 18.9x (trailing)"
       └─ Tool call #1: get_stock_info(AAPL)
            └─ Raw API response: { "trailingPE": 18.923 }

This is not metadata — it's grep-verified evidence at every link. The chain terminates at raw API responses (yfinance, FRED, SEC filings) — that is the trust boundary. Firn verifies that the report faithfully represents what the data sources actually returned, not whether the data sources themselves are correct. Every layer above the raw data is auditable; the raw data is the ground truth.

3. Deterministic verdicts, not LLM opinions

The audit agent collects evidence. A separate program (verdict.py) assigns trust levels using deterministic rule-based logic — no LLM involved in the verdict decision. This means the verdict layer itself cannot hallucinate: given the same collected evidence, the same verdicts are produced every time. The LLM's role is strictly limited to finding evidence; judging it is done by code.

4. Purpose-built search for financial data

The audit agent searches evidence using grep_trace — a ripgrep-powered search tool inspired by Claude Code's grep implementation. Financial data appears in wildly different formats across the evidence chain: a report says "$5.1B", the specialist wrote "revenue $5.1B (+26.2% YoY)", and the raw API returned {"totalRevenue": 5098000000}. The tool supports full regex with OR alternatives ("5.1|5098|5100") so the agent can bridge these format gaps in a single query.

Each grep result is automatically annotated with the originating tool call ([@ tool_call #3: get_income_statement]), giving the agent immediate provenance without manual cross-referencing. And every grep call is recorded — when the agent attempts to submit evidence, the system verifies it against actual grep history, closing the loop on fabrication.

The Audit Pipeline

The audit runs in three phases. Each phase has a clear purpose, constrained scope, and machine-verified outputs.

Phase 1 — Specialist Fidelity: "Did each specialist faithfully report its data?"

Four parallel audit agents, one per specialist (fundamental, technical, value, macro). Each agent:

Reads the specialist's output (trace/specialist_outputs/fundamental_output.md)
For every factual claim, searches the specialist's raw tool data (tools/fundamental_tool_calls.json)
Records a fidelity verdict: found (grep matched), derived (inputs present, arithmetic by LLM), or not-found

Example — Specialist Fidelity Check:

  Specialist output:  "Current ratio improved to 1.47x from 1.32x"

  Agent searches:     grep_trace("1.47", "tools/fundamental_tool_calls.json")
  Grep result:        line 186: "currentRatio": 1.4692   [@ tool_call #3: get_financial_metrics]

  Agent searches:     grep_trace("1.32", "tools/fundamental_tool_calls.json")
  Grep result:        line 204: "currentRatio": 1.3218   [@ tool_call #3: get_financial_metrics]

  Verdict:            FOUND — both values trace to get_financial_metrics, tool call #3

Enforcement: The agent must paste actual grep output into grep_evidence. The tool programmatically verifies this matches the agent's grep history — fabricated evidence is rejected.

Output: specialist_citations/{agent}.jsonl — one entry per claim, with grep coordinates.

Phase 2a — Report-to-Specialist Tracing: "Can each report claim be traced to a specialist?"

A single agent reads the final report and searches specialist outputs:

Identifies every factual claim in the report (numbers, dates, metrics, comparisons)
For each claim, greps all specialist outputs to find a match
Records the specialist excerpt and grep coordinates

Scope constraint: This agent can only search trace/specialist_outputs/ — it cannot access raw tool data. This forces a clean separation of evidence paths.

Example — Report-to-Specialist Trace:

  Report says:        "Revenue grew 26% YoY to $5.1B"

  Agent searches:     grep_trace("26%|5.1", "trace/specialist_outputs/")
  Match found in:     fundamental_output.md, line 51
  Specialist wrote:   "FY2025 revenue $5.1B (+26.2% YoY)"

  Record:             claim_in_report = "Revenue grew 26% YoY to $5.1B"  (exact substring from report)
                      specialist_excerpt = "FY2025 revenue $5.1B (+26.2% YoY)"  (exact substring from output)

Phase 2b — Report-to-Source Tracing: "Does the raw data actually contain this?"

A separate agent (running in parallel with 2a) reads the same report but searches raw tool data:

For each factual claim, greps all tool call JSON files
Handles number format variations (report: "$5.1B" → raw: 5100000000)
Records the raw value, source tool, and grep coordinates

Scope constraint: This agent can only search tools/ — it cannot access specialist outputs. Two independent evidence paths, zero cross-contamination.

Example — Report-to-Source Trace:

  Report says:        "Revenue grew 26% YoY to $5.1B"

  Agent searches:     grep_trace("5.1|5100", "tools/")
  Match found in:     tools/fundamental_tool_calls.json, line 42
  Raw data:           "totalRevenue": 5098000000   [@ tool_call #2: get_income_statement]

  Record:             raw_value = "5098000000"
                      source_tool = "get_income_statement"  (auto-resolved from grep coordinates)

Verdict Merge — Deterministic, No LLM

After Phase 2a and 2b complete, verdict.py merges their outputs using program logic:

# Simplified verdict logic — no LLM, no temperature, no sampling

# 1. Special source types take priority
if source_type == "kb":
    verdict = "kb-sourced"            # Traced to knowledge base
elif source_type == "web":
    verdict = "web-sourced"           # Traced to web search
elif source_type in ("computation", "derived"):
    verdict = "computed"              # Deterministic calculation (e.g., reverse DCF)

# 2. Combined evidence rules
elif has_specialist and r1_fidelity == "found":
    verdict = "verified"              # R2a + R1: report → specialist → raw data
elif has_source or has_specialist:
    if has_specialist and not has_source and r1_fidelity != "found":
        verdict = "specialist-judgment"  # Specialist claimed it, weak data backing
    else:
        verdict = "supported"         # Partial evidence chain present
else:
    verdict = "unverified"            # No evidence found anywhere

Six Trust Levels

Verdict	Meaning	Trust	Example
Verified	Full chain via R2a + R1: report → specialist → raw data	Highest	"P/E of 18.9x" — specialist stated it, R1 confirmed it traces to raw API data
Supported	Partial evidence chain — raw data or specialist trace present	High	"Revenue $5.1B" — found in raw data (R2b), but specialist chain incomplete
KB-Sourced	Traced to knowledge base entry	High	"Uranium supply squeeze" — from KB themes/uranium-supply
Computed	Result of deterministic calculation	Medium-High	"Implied growth 9.2%" — from reverse DCF sidecar
Web-Sourced	Traced to web search result	Medium	"CEO appointed May 2026" — from web_search result
Specialist Judgment	Only specialist stated it, no raw data	Medium	"Fair value ~$150" — specialist's DCF estimate
Unverified	No evidence found in any source	None	Claim exists in report but cannot be traced

What you actually get

The verdict label is a summary — the real output is a structured citation per claim containing the complete evidence package.

Audited report with evidence trace — every claim linked to its source data

In the web UI, citations appear as inline highlights on the report text. Hover any highlighted number to see its full evidence chain:

  Report text (with audit overlay active):

  "...Alphabet trades at a trailing P/E of 18.9x⁷, near its
  5-year median of 24.2x⁸. Free cash flow reached $62.25⁹..."
                                            ─────
                                              ↓ hover
                                ┌─────────────────────────┐
                                │ ✓ Verified               │
                                │                          │
                                │ "trailing P/E of 18.9x"  │
                                │                          │
                                │ Source: get_stock_info    │
                                │ = 18.923                 │
                                │                          │
                                │ Specialist (fundamental): │
                                │ "P/E Ratio: 18.9x        │
                                │  (trailing)"             │
                                └─────────────────────────┘

Each highlighted number links to a structured citation. Here is the underlying data for citation ⁷ above:

{
  "id": 7,
  "claim": "Trailing P/E of 18.9x",
  "claim_in_report": "trailing P/E of 18.9x",
  "verdict": "verified",
  "source": {
    "agent": "fundamental",
    "tool": "get_stock_info",
    "index": 1,
    "raw_value": "18.923"
  },
  "specialist": {
    "agent": "fundamental",
    "excerpt": "P/E Ratio: 18.9x (trailing)"
  },
  "evidence": {
    "source_grep": "tools/fundamental_tool_calls.json:42: \"trailingPE\": 18.923  [@ tool_call #1: get_stock_info]",
    "specialist_grep": "trace/specialist_outputs/fundamental_output.md:15: P/E Ratio: 18.9x (trailing)"
  },
  "r1_match": {
    "agent": "fundamental",
    "claim_id": 3,
    "verdict": "found",
    "source_tool": "get_stock_info",
    "source_index": 1
  }
}

Every claim gets this treatment — not just a color label, but the exact specialist quote, the exact raw API value, the grep coordinates where evidence was found, and the R1 cross-reference confirming the specialist faithfully reported its data. A compliance officer can trace any number in the report back to the API response that produced it.

What the Audit Agent Sees

Transparency matters. Here is exactly what the audit agent has access to — and what it doesn't.

Trace directory structure (generated per analysis)

logs/{execution_id}/
├── reports/
│   └── final_report.md              ← The report being audited
├── trace/
│   ├── specialist_outputs/
│   │   ├── fundamental_output.md     ← What each specialist wrote
│   │   ├── technical_output.md
│   │   ├── value_output.md
│   │   └── macro_output.md
│   ├── react_steps/                  ← Full reasoning chains (think → tool → observe)
│   │   ├── fundamental_steps.jsonl
│   │   └── core_analysis_steps.jsonl
│   └── prompts/                      ← Exact prompts sent to each agent
│       ├── fundamental_system.txt
│       └── core_analysis_user.txt
├── tools/
│   ├── fundamental_tool_calls.json   ← Raw API responses (every tool call, full I/O)
│   ├── technical_tool_calls.json
│   ├── value_tool_calls.json
│   ├── macro_tool_calls.json
│   └── core_analysis_tool_calls.json ← Core agent's KB reads, web searches
└── audit/                            ← Audit outputs (written during audit)
    ├── citations.json                ← Final structured citations for UI
    ├── enforcement_log.jsonl         ← All rejected evidence attempts
    ├── specialist_citations/         ← Phase 1 outputs
    ├── specialist_evidence.jsonl     ← Phase 2a outputs
    └── source_evidence.jsonl         ← Phase 2b outputs

Scope constraints per phase

Phase	Can search	Can read	Cannot access
Phase 1 (per specialist)	That specialist's `tools/*.json`	That specialist's output	Other specialists' data
Phase 2a	`trace/specialist_outputs/` only	`report.md` only	`tools/` (raw data)
Phase 2b	`tools/` only	`report.md` only	`trace/specialist_outputs/`

This separation ensures Phase 2a and 2b provide independent evidence paths — their outputs are only combined by deterministic program logic in the verdict merge.

System Architecture

Analysis Pipeline

User: "Analyze AAPL"
    │
    ├─→ Fundamental Agent ──→ earnings, revenue, cash flow, balance sheet
    ├─→ Technical Agent   ──→ price trends, indicators, support/resistance
    ├─→ Value Agent       ──→ valuation multiples, DCF, peer comparison
    └─→ Macro Agent       ──→ rates, inflation, market regime, yield curve
                                │
                    ┌───────────┘
                    ▼
              Core Agent (Firn)
              + Knowledge Base context
              + Web search capability
                    │
                    ▼
            Comprehensive Report
                    │
                    ▼
          ┌─── Audit Pipeline ───┐
          │ Phase 1: Fidelity    │
          │ Phase 2a: Specialist │
          │ Phase 2b: Source     │
          │ Merge: Verdicts      │
          └──────────────────────┘
                    │
                    ▼
          Audited Report + Citations

Versioned Research Memory

Firn maintains a persistent, auditable knowledge base — every belief update is sourced, versioned, and reviewable.

New information is read, compressed into themes, and recorded as auditable revisions to the agent's long-term structured memory. The knowledge base maintains separate sections for the agent's own views, user-provided views, and points of divergence between the two — a structural defense against LLM sycophancy that keeps analysis grounded in accumulated evidence rather than user expectations.

After 5 months and 400+ articles across 96 training epochs, the knowledge base represents a compressed, structured understanding of markets. The mind is updated, not overwritten — every revision is traceable to its source.

The Glacier Metaphor

The name Firn comes from glaciology — it's the intermediate stage between fresh snow and glacier ice, where each year's snowfall compresses into a denser, more permanent layer.

Concept	Firn Mapping
Fresh snow	New data — articles, market feeds, macro indicators
Compression into firn	Digest processing — reading, evaluating, synthesizing
Ice layers (strata)	Knowledge base — structured, persistent, layered by time
Drilling an ice core	Analysis — penetrating through accumulated knowledge
Air bubbles in ice	Audit trail — each claim permanently traceable to its source
Reading the core	Verification — inspecting what was sealed at each layer

Roadmap

Web UI

The included web-ui/ is a fully functional Next.js interface that connects to the backend API. It features:

Analysis Theater — real-time React Flow DAG visualization of the multi-agent pipeline
Digest Theater — immersive 3-zone view of knowledge accumulation (reading stack, Firn presence, knowledge strata)
KB Explorer — browse the agent's notebook, themes, stock notes, and core worldview
Audit Citations — view per-claim verdicts overlaid on the analysis report

Firn UI v2 (Coming Soon)

An editorial-grade redesign is in active development — glacier-inspired visual language with light backgrounds, serif headlines, and generous whitespace. Designed as a reading experience rather than a dashboard.

Homepage — glacier photography, editorial typography, analysis input

Audited research note — inline citation markers linked to evidence chain

Archive — sealed core samples with claim counts, verification stats, and immutable records

Tech Stack

Layer	Technology	Details
Agent Framework	LangGraph	Multi-agent orchestration, parallel fan-out/fan-in, ReAct loops
Data Tools	MCP (Model Context Protocol)	27 tools across 10 modules — stdio transport, TTL cache
Data Sources	yfinance, FRED, StockTwits, Reddit, Tavily, + pluggable scrapers	Market data, macro, sentiment, analyst research, financial blogs
LLM	Multi-provider	DeepSeek, Gemini, Claude — provider-agnostic design
API	FastAPI	SSE streaming, cookie auth, execution traces
Knowledge Base	File-based (Markdown + YAML)	Persistent, git-trackable, human-readable
Frontend	Next.js 16, React 19, Tailwind v4	Editorial glacier aesthetic (in development)
Testing	pytest	1,000+ tests (880 agent + 138 MCP)

By the Numbers

Metric	Value
Agent tools	47 (27 MCP + 10 KB + 2 Web + 8 Audit)
Automated tests	1,000+
Audit trust levels	6 (deterministic classification)
Specialist agents	4 (parallel execution)
Knowledge articles	400+ (across 96 training epochs, 5 months)
Audit verification rate	91.4% verified+supported (GOOG benchmark)
Agent code	~20,000 lines Python
MCP server code	~7,300 lines Python

Quick Start

Prerequisites

Python 3.12+
uv (Python package manager)
Node.js 20+ and pnpm (for frontend)
API keys: at minimum one LLM provider (DeepSeek/Gemini/Anthropic) + FRED

Setup

# Clone the repository
git clone https://github.com/M-HuangX/Firn.git
cd firn

# Install agent dependencies
cd global-market-agent
uv sync

# Configure environment
cp .env.example .env
# Edit .env with your API keys (see .env.example for all options)

# Install MCP server dependencies
cd ../global-market-mcp
uv sync

# Install frontend dependencies
cd ../web-ui
pnpm install

Run an Analysis

cd global-market-agent

# Analyze a stock
uv run python -m src --ticker AAPL

# Analyze with automatic audit
uv run python -m src --ticker AAPL --with-audit

# Audit a previous analysis
uv run python -m src --audit latest

# Run the digest pipeline (process new articles)
uv run python -m src --digest

# Check system status
uv run python -m src --status

Start the Web UI

# Terminal 1: Start the API server
cd global-market-agent
uv run uvicorn src.api.app:app --host 0.0.0.0 --port 8000

# Terminal 2: Start the frontend
cd web-ui
pnpm dev
# Open http://localhost:3000

Run Tests

# Agent tests (880+)
cd global-market-agent && uv run pytest

# MCP server tests (138+)
cd global-market-mcp && uv run pytest

# Frontend tests
cd web-ui && npx vitest run

Project Structure

firn/
├── global-market-agent/          # LangGraph multi-agent system
│   ├── src/
│   │   ├── agents/               # 4 specialists + core agent + filter
│   │   ├── audit/                # 3-phase audit pipeline (the core innovation)
│   │   ├── knowledge_base/       # Persistent KB with 10 tools
│   │   ├── sources/              # Data ingestion (pluggable scrapers)
│   │   ├── api/                  # FastAPI + SSE streaming
│   │   └── utils/                # Execution logger, context manager
│   └── tests/                    # 880+ tests
├── global-market-mcp/            # MCP data server
│   ├── src/                      # 27 tools across 10 modules
│   └── tests/                    # 138+ tests
└── web-ui/                       # Next.js frontend (functional + v2 redesign in progress)

License

This project is licensed under the GNU Affero General Public License v3.0 — you are free to use, modify, and distribute this software. If you deploy a modified version as a network service, you must make the source code available under the same license.

For commercial licensing inquiries, please contact xin.huang@epfl.ch.

Named after firn — compacted snow on its way to becoming glacier ice.
Each layer preserves what fell. Each bubble seals the air of its time.
Firn builds knowledge the same way: layer by layer, verifiable to the core.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
docs/images		docs/images
global-market-agent		global-market-agent
global-market-mcp		global-market-mcp
web-ui		web-ui
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

The Problem

How Firn Audits

What makes this different

The Audit Pipeline

Phase 1 — Specialist Fidelity: "Did each specialist faithfully report its data?"

Phase 2a — Report-to-Specialist Tracing: "Can each report claim be traced to a specialist?"

Phase 2b — Report-to-Source Tracing: "Does the raw data actually contain this?"

Verdict Merge — Deterministic, No LLM

Six Trust Levels

What you actually get

What the Audit Agent Sees

Trace directory structure (generated per analysis)

Scope constraints per phase

System Architecture

Analysis Pipeline

Versioned Research Memory

The Glacier Metaphor

Roadmap

Web UI

Firn UI v2 (Coming Soon)

Tech Stack

By the Numbers

Quick Start

Prerequisites

Setup

Run an Analysis

Start the Web UI

Run Tests

Project Structure

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages