Skip to content

kasssandr/archilles

ARCHILLES

Intelligent search for your Calibre library

A privacy-first RAG system that brings semantic search to your personal research library. Works with Calibre, Zotero, Obsidian vaults, and plain folders. Built for scholars, researchers, and anyone with a serious book collection.

License: MIT Python 3.11–3.12 MCP Compatible


Quick StartFeaturesDocumentationRoadmaparchilles.org


What is Archilles?

If you're a researcher, you know this problem: You've spent years building a carefully curated library in Calibre—hundreds or thousands of books, annotated and tagged. But when you need to find that specific argument about medieval trade routes, or compare how three different authors approach consciousness, you're stuck with keyword search. You know the passage exists. You just can't find it.

Archilles solves this.

It's a semantic search system built specifically for Calibre libraries. Instead of matching keywords, it understands meaning. Ask it "discussions of political legitimacy in early modern Europe" and it finds relevant passages—even if they never use those exact words.

Everything runs locally on your machine. Your library, your annotations, your research—they stay private. No cloud services, no data uploads, no subscriptions.

Built on solid foundations

  • Retrieval-Augmented Generation (RAG): Combines semantic embeddings with keyword search for best-of-both-worlds accuracy
  • Model Context Protocol (MCP): Native integration with Claude and other AI assistants
  • Calibre Integration: Works seamlessly with your existing library structure
  • Local-First: LanceDB for vector storage, all processing happens on your hardware

Key Features

🧠 Semantic Search

Find books by meaning, not just keywords. Ask natural questions and get relevant passages from across your entire library.

🔒 Privacy-First

All data stays on your machine. No cloud uploads, no telemetry, no tracking. Your research library remains private.

🔗 MCP-Native

Seamless integration with Claude Desktop and other MCP-compatible tools. Your AI assistant can search your library directly.

Works with Claude Desktop (stdio) and ChatGPT Desktop, OpenAI Codex, Cursor and any other MCP client that connects via URL (SSE transport). See MCP Integration Guide →

📚 Calibre-Integrated

Reads directly from your Calibre library structure. Extracts metadata, tags, comments, annotations, and custom fields automatically.

💬 Comments & Annotations

Searches beyond book text. Your Calibre comments, highlights, and notes are all indexed and searchable.

🏷️ Tag-Aware

Filter by Calibre tags, combine searches across custom fields, leverage all the organization you've already done.

🌍 Multilingual

Built-in language detection for 75+ languages. Search in German, English, Latin, Greek, French—or all at once.

Hybrid Search

Combines semantic understanding (BGE-M3 embeddings) with keyword precision (BM25). Get the best of both approaches.

🎯 Cross-Encoder Reranking (optional)

Enable a second-stage reranker (BAAI/bge-reranker-v2-m3) that scores each query-document pair for significantly improved relevance ranking. Graceful fallback if your system has limited memory.

🔖 Research Interest Boosting

Register project-specific keywords once and they automatically receive a score boost in every subsequent search—without re-indexing. Switch focus between projects in seconds. Managed via the set_research_interests MCP tool.

📤 Academic Bibliography Export

Export your library (or any filtered subset) as BibTeX, RIS, EndNote, JSON, or CSV. Filter by author, tag, or publication year. One tool call from Claude Desktop is all it takes.


Why Archilles?

Archilles Cloud RAG Services Calibre Search Other MCP Servers
Privacy-first, local processing Your data uploaded to cloud Basic keyword matching Often single-purpose
Semantic + keyword hybrid Usually semantic only No semantic understanding Varying capabilities
Calibre-native integration Generic document handling Built-in but limited May not support Calibre
One-time setup, no subscriptions Monthly fees, usage limits Free (included) Varies widely
Full control over your data Terms of service apply Your data, basic search Depends on service

Archilles gives you the semantic search capabilities of modern RAG systems while keeping everything under your control. If you've invested years in building and organizing your Calibre library, Archilles makes that investment exponentially more valuable.


Quick Start

Prerequisites

  • Python 3.11 or 3.12 (3.13+ not yet tested)
  • Calibre with your book library
  • (Optional) An MCP-compatible client: Claude Desktop (stdio), or ChatGPT Desktop / Codex / Cursor via SSE

Installation

# Clone the repository
git clone https://github.com/kasssandr/archilles.git
cd archilles

# Install dependencies
pip install -r requirements.txt

# Set your library path
# Windows PowerShell:
$env:ARCHILLES_LIBRARY_PATH = "D:\Your-Library"
# Linux/Mac:
export ARCHILLES_LIBRARY_PATH="/path/to/your/Library"

Index Your First Book

# Index a single book
python scripts/rag_demo.py index "/path/to/Calibre Library/Author/Book/book.pdf"

# Check your index
python scripts/rag_demo.py stats

Batch Index by Tag

# Preview what would be indexed (dry run)
python scripts/batch_index.py --tag "Your-Tag" --dry-run

# Index all books with a specific Calibre tag
python scripts/batch_index.py --tag "History"

# Index with progress logging
python scripts/batch_index.py --tag "History" --log indexing.json

# Resume interrupted indexing (skip already indexed books)
python scripts/batch_index.py --tag "History" --skip-existing

Search Your Library

# Hybrid search (recommended - combines semantic + keyword)
python scripts/rag_demo.py query "trade networks in medieval Europe"

# Filter by language
python scripts/rag_demo.py query "Rex" --language la

# Filter by tags
python scripts/rag_demo.py query "political theory" --tag-filter Philosophy History

# Export results to Markdown (for Joplin/Obsidian)
python scripts/rag_demo.py query "consciousness" --export results.md

Claude Desktop Integration

Add to your Claude Desktop config (%APPDATA%\Claude\claude_desktop_config.json on Windows):

{
  "mcpServers": {
    "archilles": {
      "command": "python",
      "args": ["C:/Users/YOU/archilles/mcp_server.py"],
      "env": {
        "ARCHILLES_LIBRARY_PATH": "D:/Your-Library"
      }
    }
  }
}

Then in Claude Desktop, you can use natural language — all 12 MCP tools are available:

  • "Search my books for discussions of political legitimacy"
  • "Find annotations about consciousness"
  • "What did I highlight about medieval trade?"
  • "List all books by Hannah Arendt in my library"
  • "Set my research interests to: prosopography, late antique senators, cursus honorum"
  • "Export my Philosophy books as BibTeX"

📖 Full Installation Guide →

Keeping Your Index in Sync

As you add books, edit tags, or highlight passages in Calibre, the LanceDB index drifts out of date. The built-in Watchdog closes that gap automatically:

  • Claude users — create a Routine that calls the watchdog_scan MCP tool on whatever schedule fits (daily / weekly / monthly). No shell, no scheduler.
  • Everyone else — run python scripts/watchdog.py via Windows Task Scheduler, cron, or launchd. After the first scan has seeded its annotation-signature cache, repeat scans take seconds on a few-thousand-book library.

Daily or weekly is usually enough; hourly is overkill. See MCP Integration Guide → Keeping Your Index in Sync for concrete Task Scheduler / cron / launchd examples.


How It Works

Archilles builds a semantic index of your Calibre library that enables intelligent search:

┌─────────────┐
│   Calibre   │ ← Your existing library (books, metadata, tags, comments)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Extractors │ ← PyMuPDF (primary), pdfplumber, EPUB, MOBI, DJVU...
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   LanceDB   │ ← BGE-M3 embeddings + BM25 full-text (hybrid search)
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Retriever  │ ← RRF fusion + optional cross-encoder reranking
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Service   │ ← ArchillesService: central facade for all consumers
└──┬───┬───┬──┘
   │   │   │
   ▼   ▼   ▼
 MCP  Web  CLI  ← Claude Desktop, Streamlit UI, command line

What Gets Indexed

  • Book text: Full-text extraction from 30+ formats (PDF via PyMuPDF, EPUB, MOBI, DJVU, etc.)
  • Calibre metadata: Title, author, publisher, ISBN, language
  • Tags: Your Calibre tags become searchable
  • Comments: Calibre's comments field (HTML cleaned automatically)
  • Custom fields: Any custom Calibre fields you've defined (reading status, projects, ratings, etc.)
  • Annotations: Your Calibre highlights and notes (searchable via search_annotations)

Search Technology

  • BGE-M3 embeddings: State-of-the-art multilingual semantic understanding (1024 dimensions, GPU)
  • BM25 keyword search: Precision matching for exact terms (names, Latin phrases, technical terms)
  • Reciprocal Rank Fusion (RRF): Intelligently combines semantic and keyword results (stage 1)
  • Cross-encoder reranking (optional): BAAI/bge-reranker-v2-m3 rescores top candidates for more accurate ranking (stage 2, CPU)
  • Stop-word removal: Applied at indexing and query time for EN, DE, FR, ES, IT, PT, NL, LA, RU, EL, HE, AR—language-appropriate precision at scale
  • Section filtering: Exclude bibliography, index, and front matter noise from results
  • Context expansion: Small-to-Big retrieval shows surrounding text for better understanding
  • Smart boosting: Calibre comments (1.2×) and tag matches (1.15×) get priority in results
  • Research interest boosting: Additive keyword boost at query time—no re-indexing required

Configuration

Archilles reads optional configuration from .archilles/config.json inside your Calibre library:

{
  "enable_reranking": true,
  "reranker_device": "cpu"
}
Option Default Description
enable_reranking false Enable cross-encoder reranking (more accurate but slower; downloads ~560MB model on first use)
reranker_device "cpu" Device for reranker inference ("cpu" or "cuda"). CPU recommended when GPU runs BGE-M3
rag_db_path .archilles/rag_db Custom path for the vector database

🏗️ Architecture Details →


Use Cases

📜 Historian

"Find all discussions of trade routes between Mediterranean and Northern Europe before 1500"

Archilles searches across your entire collection—Latin primary sources, German monographs, English translations—and surfaces relevant passages based on concepts, not just keywords.

📖 Literary Scholar

"Trace the motif of unreliable narrators across these 50 twentieth-century novels"

Semantic search finds passages that demonstrate unreliable narration, even when the texts never use that term. Your annotations and comments help prioritize the most relevant examples.

🤔 Philosopher

"Compare views on the hard problem of consciousness across Chalmers, Dennett, and Nagel"

Hybrid search combines precise name matching with semantic understanding of philosophical concepts. Your Calibre tags help filter to relevant texts.

🎵 Musicologist

"Find theoretical discussions of modal harmony in Renaissance treatises"

Multilingual search works across Latin treatises, Italian commentary, and modern scholarship. Technical terms get exact matching while broader concepts use semantic search.

⚖️ Legal Researcher

"Locate all references to customary law in medieval court records"

Search through your collection of primary sources and secondary literature simultaneously. Custom Calibre fields (like "source_type" or "jurisdiction") help organize results.


Product Roadmap

Current Release: v0.9 Beta (March 2026)

Core functionality complete:

  • Full-text indexing (30+ formats, PyMuPDF primary for PDFs)
  • Semantic + keyword hybrid search (LanceDB native)
  • Two-stage retrieval: RRF fusion + optional cross-encoder reranking
  • Calibre metadata integration (tags, comments, custom fields, annotation indexing)
  • MCP server — 12 tools for Claude Desktop (search, metadata, annotations, bibliography, utilities)
  • Multi-language support (75+ languages, stop-word removal for 12 languages)
  • BGE-M3 embeddings (multilingual, 1024 dimensions)
  • OCR support for scanned PDFs (Tesseract)
  • Hardware-adaptive indexing profiles (minimal/balanced/maximal; CUDA, Apple Silicon MPS, CPU)
  • Streamlit Web UI (experimental)
  • Section-type filtering (exclude bibliography/index noise)
  • Context expansion (Small-to-Big retrieval with window_text)
  • Service layer architecture (decoupled MCP/Web-UI/CLI)
  • Page labels (printed page numbers) for citation accuracy
  • Research interest boosting (set_research_interests)
  • list_books_by_author — direct Calibre metadata query, reliable for articles and short texts
  • Bibliography export: BibTeX, RIS, EndNote, JSON, CSV
  • Crash-safe batch indexing with progress.db checkpoint system and backup rotation
  • Duplicate detection and calibre:// URI links for direct book access
  • Source adapters: Calibre, Zotero, Obsidian vaults, plain folders — not just Calibre anymore
  • Structure-aware PDF chunking: chapter/section metadata from TOC, running footer removal
  • DialogueChunker: specialized chunking for chat/Q&A exports (ChatGPT, Gemini, Grok, NotebookLM)

Coming in v1.0

🚧 Planned improvements:

  • Incremental indexing (index new books without full re-index of the collection)
  • Docling-based Markdown extraction (structured output from complex academic PDFs)
  • VLM-based OCR (LightOnOCR-2, GOT-OCR 2.0)

Future Development

🔮 On the horizon:

  • Graph RAG (entity relationships, timeline views)
  • Special Editions (discipline-specific extensions)
  • Multi-library support

📅 Detailed Roadmap →


Special Editions (Future)

Archilles is being developed as a modular platform. The core (what you're using now) will always be free and open source.

Special Editions will extend Archilles with discipline-specific features for researchers who need them:

  • 📜 Historical Edition: Timeline visualization, prosopography, chronology-aware search
  • 📖 Literary Edition: Motif tracking, intertextual connections, narrative structure analysis
  • ⚖️ Legal Edition: Citation networks, precedent tracking, jurisdiction-aware search
  • 🎵 Musical Edition: Score analysis integration, theoretical terminology, composer networks

These editions are commercial add-ons to support ongoing development. The core will remain MIT licensed and fully functional.

🎯 Edition Details →


Community & Contributing

Get Help

Contribute

Archilles is open source (MIT License). Contributions are welcome!

Beta Testing

We're actively seeking beta testers from diverse research disciplines. If you have a substantial Calibre library (500+ books) and want to help shape Archilles, join our beta program.

Code of Conduct: We're committed to building a welcoming community. See CODE_OF_CONDUCT.md.


Documentation

📖 Installation Guide – Detailed setup instructions 📘 Usage Guide – All commands, MCP tools, and practical workflows 🗂️ Feature Catalog – Complete reference of all implemented features 🏗️ Architecture – Technical deep dive 🔌 MCP Integration – Connect Archilles to Claude Desktop ❓ FAQ – Frequently asked questions 🔧 Troubleshooting – Common issues and solutions


Legal & Privacy

License

Archilles is released under the MIT License. Free to use, modify, and distribute.

Privacy Statement

Archilles is local-first software. We collect no telemetry, no analytics, no usage data. Your library stays on your machine.

User Responsibility

You are responsible for ensuring your use of Archilles complies with copyright law in your jurisdiction. Archilles is a tool for searching your own legally acquired library.


Acknowledgments

Archilles is built on the shoulders of giants:

  • Calibre by Kovid Goyal – The gold standard for e-book library management
  • LanceDB – High-performance vector database with native hybrid search
  • Model Context Protocol by Anthropic – Standardized AI assistant integration
  • BGE-M3 – State-of-the-art multilingual embeddings
  • Anthropic Claude – AI assistant that respects user privacy

Inspired by NotebookLM, Zotero, and decades of digital humanities research.


Contact & Links

🌐 Website: archilles.orgarchilles.de 💻 GitHub: github.com/kasssandr/archilles 💬 Discussions: GitHub Discussions 📧 Contact: hello@archilles.org


Built for researchers, by a researcher.

Archilles: Because your library deserves better than keyword search.

About

RAG for researchers: page-level citations from your personal library, LLM access via MCP. Ask your entire archive, get answers with the intelligence of leading AI.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors