Skip to content

mulkatz/mulder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

469 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Mulder

Mulder

Config-driven Document Intelligence Platform on GCP
Turn document collections into searchable knowledge graphs — defined by one config file, deployed by one command.

The truth is in the documents.

Live Demo License TypeScript GCP Status

Live Demo · Functional Spec · Roadmap · Example Config


Development Progress55 / 125 steps

M1  Foundation       ██████████████████████████████ 11/11 ✓
M2  Ingest+Extract   ██████████████████████████████  9/9  ✓
M3  Segment+Enrich   ██████████████████████████████ 10/10 ✓
QA Gate: Pre-Search  ██████████████████████████████  6/6  ✓
M4  Search (v1.0)    ██████████████████████████████ 11/11 ✓
QA Gate: Post-MVP    ██████████████████████████████  7/7  ✓
M5  Curation         ██████░░░░░░░░░░░░░░░░░░░░░░░  1/5
M6  Intelligence     ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/7
M7  API+Workers      ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/9
M8  Operations       ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/6
M9  Multi-Format     ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/13
M10 Provenance       ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/9
M11 Trust Layer      ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/5
M12 Discovery        ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/4
M13 Observability    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/5
M14 Research Agent   ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/8

Mulder Dashboard

What it does

Mulder transforms unstructured document collections — PDFs with complex layouts like magazines, newspapers, government correspondence — into structured, searchable knowledge.

You define your domain ontology in a single mulder.config.yaml. The pipeline adapts: extraction, entity resolution, retrieval, and analysis all derive from that one config file. No custom code per domain.

mulder.config.yaml  →  terraform apply  →  mulder pipeline run ./pdfs/  →  mulder query "..."

Capabilities

# Capability What it does
1 Layout Extraction Document AI + Gemini Vision fallback for magazines, newspapers, multi-column layouts
2 Domain Ontology One YAML defines entities, relationships, extraction rules. Gemini structured output with auto-generated JSON Schema.
3 Taxonomy Auto-bootstrapped after ~25 docs, incremental growth, human-in-the-loop curation, cross-lingual
4 Hybrid Retrieval Vector (pgvector) + BM25 (tsvector) + graph traversal (recursive CTEs), fused via RRF + LLM re-ranking
5 Web Grounding Gemini verifies entities against live web data — coordinates, bios, org descriptions
6 Spatio-Temporal PostGIS proximity queries, temporal clustering, pattern detection across time and space
7 Evidence Scoring Corroboration scores, two-phase contradiction detection, source reliability (PageRank), evidence chains
8 Cross-Lingual Resolution 3-tier entity resolution (attribute match, embedding similarity, LLM-assisted) across 100+ languages
9 Deduplication MinHash/SimHash near-duplicate detection, dedup-aware corroboration scoring
10 Schema Evolution Config-hash tracking per document per step, selective reprocessing after config changes
11 Visual Intelligence (v3.0 / Phase 2) Image extraction, Gemini analysis, image embeddings, map/diagram data extraction
12 Pattern Discovery (v3.0 / Phase 2) Cluster anomalies, temporal spikes, subgraph similarity, proactive insights

Pipeline

          PDF
           │
     ┌─────▼─────┐
     │   Ingest  │  Upload to Cloud Storage, pre-flight validation
     └─────┬─────┘
           │
     ┌─────▼─────┐
     │  Extract  │  Document AI + Gemini Vision fallback → layout JSON + page images → GCS
     └─────┬─────┘
           │
     ┌─────▼─────┐
     │  Segment  │  Gemini identifies stories from page images → Markdown + metadata → GCS
     └─────┬─────┘
           │
     ┌─────▼─────┐
     │   Enrich  │  Entity extraction, taxonomy normalization, cross-lingual resolution
     └─────┬─────┘
           │
     ┌─────▼─────┐
     │   Ground  │  Web enrichment via Gemini Search — coordinates, bios, verification
     └─────┬─────┘
           │
     ┌─────▼─────┐
     │   Embed   │  Semantic chunking + text-embedding-004 (768-dim) → pgvector + BM25
     └─────┬─────┘
           │
     ┌─────▼─────┐
     │   Graph   │  Deduplication, corroboration scoring, contradiction flagging
     └─────┬─────┘
           │
     ┌─────▼─────┐
     │  Analyze  │  Contradiction resolution, PageRank reliability, evidence chains
     └─────┬─────┘
           │
       Knowledge
         Graph

Every step is idempotent, independently runnable, and CLI-accessible. Content artifacts live in GCS, search index in PostgreSQL.

Configuration

All domain logic lives in mulder.config.yaml. Define your domain, the pipeline adapts:

project:
  name: investigative-journalism

ontology:
  entity_types:
    - name: person
      description: Individual mentioned in documents
      attributes:
        - { name: role, type: string }
        - { name: affiliation, type: string }
    - name: event
      description: A specific incident or occurrence
      attributes:
        - { name: date, type: date }
        - { name: location, type: string }
    - name: location
      description: Geographic place
      attributes:
        - { name: coordinates, type: geo_point, optional: true }

  relationships:
    - { name: involved_in, from: person, to: event }
    - { name: occurred_at, from: event, to: location }

Everything beyond project and ontology has sensible defaults. See mulder.config.example.yaml for the full reference.

Architecture

Single PostgreSQLpgvector + tsvector + PostGIS + recursive CTEs + job queue — one instance, no graph DB, no Redis, no Pub/Sub
Content in GCSPDFs, layout JSON, page images, story Markdown in Cloud Storage. PostgreSQL holds references + search index only.
Service AbstractionAll GCP services behind interfaces. Dev mode uses fixtures — zero API calls, zero cost.
CLI-firstEvery capability is a CLI command. The API is a job producer, not a direct executor.
PostgreSQL is truthPipeline state, job queue, config tracking. Firestore is observability-only (UI monitoring).

Baseline cost: ~30-40 EUR/mo for a small Cloud SQL instance. Scales with Gemini API usage.

Tech Stack

Language TypeScript (ESM, strict mode)
Monorepo pnpm + Turborepo
Infrastructure Terraform (modular)
OCR Document AI Layout Parser
LLM Gemini 2.5 Flash (Vertex AI)
Embeddings text-embedding-004 (768-dim Matryoshka)
Database Cloud SQL PostgreSQL
Search pgvector (HNSW) + tsvector (BM25) + recursive CTEs
Geospatial PostGIS
CLI Commander.js
Testing Vitest

Status

Mulder's v1.0 MVP (M4) is complete — the full pipeline from ingest through hybrid retrieval is operational. PDFs go in, a knowledge graph comes out, and natural-language queries return ranked passages with LLM re-ranking. The functional spec, implementation roadmap, and config schema are finalized.

See the roadmap for all 14 milestones from foundation to autonomous research agent.

Contributing

Contributions, feedback, and ideas are welcome. Open an issue or start a discussion.

License

Apache 2.0

About

Config-driven Document Intelligence Platform on GCP. PDFs → Knowledge Graph, defined by one YAML, deployed by one command.

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages