Config-driven Document Intelligence Platform on GCP
Turn document collections into searchable knowledge graphs — defined by one config file, deployed by one command.
The truth is in the documents.
Live Demo · Functional Spec · Roadmap · Example Config
|
Development Progress |
Mulder transforms unstructured document collections — PDFs with complex layouts like magazines, newspapers, government correspondence — into structured, searchable knowledge.
You define your domain ontology in a single mulder.config.yaml. The pipeline adapts: extraction, entity resolution, retrieval, and analysis all derive from that one config file. No custom code per domain.
mulder.config.yaml → terraform apply → mulder pipeline run ./pdfs/ → mulder query "..."
| # | Capability | What it does |
|---|---|---|
| 1 | Layout Extraction | Document AI + Gemini Vision fallback for magazines, newspapers, multi-column layouts |
| 2 | Domain Ontology | One YAML defines entities, relationships, extraction rules. Gemini structured output with auto-generated JSON Schema. |
| 3 | Taxonomy | Auto-bootstrapped after ~25 docs, incremental growth, human-in-the-loop curation, cross-lingual |
| 4 | Hybrid Retrieval | Vector (pgvector) + BM25 (tsvector) + graph traversal (recursive CTEs), fused via RRF + LLM re-ranking |
| 5 | Web Grounding | Gemini verifies entities against live web data — coordinates, bios, org descriptions |
| 6 | Spatio-Temporal | PostGIS proximity queries, temporal clustering, pattern detection across time and space |
| 7 | Evidence Scoring | Corroboration scores, two-phase contradiction detection, source reliability (PageRank), evidence chains |
| 8 | Cross-Lingual Resolution | 3-tier entity resolution (attribute match, embedding similarity, LLM-assisted) across 100+ languages |
| 9 | Deduplication | MinHash/SimHash near-duplicate detection, dedup-aware corroboration scoring |
| 10 | Schema Evolution | Config-hash tracking per document per step, selective reprocessing after config changes |
| 11 | Visual Intelligence (v3.0 / Phase 2) | Image extraction, Gemini analysis, image embeddings, map/diagram data extraction |
| 12 | Pattern Discovery (v3.0 / Phase 2) | Cluster anomalies, temporal spikes, subgraph similarity, proactive insights |
PDF
│
┌─────▼─────┐
│ Ingest │ Upload to Cloud Storage, pre-flight validation
└─────┬─────┘
│
┌─────▼─────┐
│ Extract │ Document AI + Gemini Vision fallback → layout JSON + page images → GCS
└─────┬─────┘
│
┌─────▼─────┐
│ Segment │ Gemini identifies stories from page images → Markdown + metadata → GCS
└─────┬─────┘
│
┌─────▼─────┐
│ Enrich │ Entity extraction, taxonomy normalization, cross-lingual resolution
└─────┬─────┘
│
┌─────▼─────┐
│ Ground │ Web enrichment via Gemini Search — coordinates, bios, verification
└─────┬─────┘
│
┌─────▼─────┐
│ Embed │ Semantic chunking + text-embedding-004 (768-dim) → pgvector + BM25
└─────┬─────┘
│
┌─────▼─────┐
│ Graph │ Deduplication, corroboration scoring, contradiction flagging
└─────┬─────┘
│
┌─────▼─────┐
│ Analyze │ Contradiction resolution, PageRank reliability, evidence chains
└─────┬─────┘
│
Knowledge
Graph
Every step is idempotent, independently runnable, and CLI-accessible. Content artifacts live in GCS, search index in PostgreSQL.
All domain logic lives in mulder.config.yaml. Define your domain, the pipeline adapts:
project:
name: investigative-journalism
ontology:
entity_types:
- name: person
description: Individual mentioned in documents
attributes:
- { name: role, type: string }
- { name: affiliation, type: string }
- name: event
description: A specific incident or occurrence
attributes:
- { name: date, type: date }
- { name: location, type: string }
- name: location
description: Geographic place
attributes:
- { name: coordinates, type: geo_point, optional: true }
relationships:
- { name: involved_in, from: person, to: event }
- { name: occurred_at, from: event, to: location }Everything beyond project and ontology has sensible defaults. See mulder.config.example.yaml for the full reference.
| Single PostgreSQL | pgvector + tsvector + PostGIS + recursive CTEs + job queue — one instance, no graph DB, no Redis, no Pub/Sub |
| Content in GCS | PDFs, layout JSON, page images, story Markdown in Cloud Storage. PostgreSQL holds references + search index only. |
| Service Abstraction | All GCP services behind interfaces. Dev mode uses fixtures — zero API calls, zero cost. |
| CLI-first | Every capability is a CLI command. The API is a job producer, not a direct executor. |
| PostgreSQL is truth | Pipeline state, job queue, config tracking. Firestore is observability-only (UI monitoring). |
Baseline cost: ~30-40 EUR/mo for a small Cloud SQL instance. Scales with Gemini API usage.
| Language | TypeScript (ESM, strict mode) |
| Monorepo | pnpm + Turborepo |
| Infrastructure | Terraform (modular) |
| OCR | Document AI Layout Parser |
| LLM | Gemini 2.5 Flash (Vertex AI) |
| Embeddings | text-embedding-004 (768-dim Matryoshka) |
| Database | Cloud SQL PostgreSQL |
| Search | pgvector (HNSW) + tsvector (BM25) + recursive CTEs |
| Geospatial | PostGIS |
| CLI | Commander.js |
| Testing | Vitest |
Mulder's v1.0 MVP (M4) is complete — the full pipeline from ingest through hybrid retrieval is operational. PDFs go in, a knowledge graph comes out, and natural-language queries return ranked passages with LLM re-ranking. The functional spec, implementation roadmap, and config schema are finalized.
See the roadmap for all 14 milestones from foundation to autonomous research agent.
Contributions, feedback, and ideas are welcome. Open an issue or start a discussion.
