folded · folded · Mar 16, 2026 · Jan 8, 2026 · Jan 19, 2026 · Jan 19, 2026
diff --git a/README.md b/README.md
@@ -4,90 +4,73 @@
 
 ## Traceable Generative Markdown for PDFs
 
-Gemini OCR is a library designed to convert PDF documents into clean, semantic Markdown while maintaining precise traceability back to the source coordinates. It bridges the gap between the readability of Generative AI (Gemini, Document AI Chunking) and the grounded accuracy of traditional OCR (Google Document AI).
+`gemini-ocr` provides [anchorite](https://github.com/folded/anchorite) provider
+plugins that convert PDFs to traceable Markdown using Google Cloud APIs.
 
-## Key Features
-
-- **Generative Markdown**: Uses Google's Gemini Pro or Document AI Layout models to generate human-readable Markdown with proper structure (headers, tables, lists).
-- **Precision Traceability**: Aligns the generated Markdown text back to the original PDF coordinates using detailed OCR data from Google Document AI.
-- **Reverse-Alignment Algorithm**: Implements a robust "reverse-alignment" strategy that starts with the readable text and finds the corresponding bounding boxes, ensuring the Markdown is the ground truth for content.
-- **Confidence Metrics**: (New) Includes coverage metrics to quantify how much of the Markdown content is successfully backed by OCR data.
-- **Pagination Support**: Automatically handles PDF page splitting and merging logic.
-
-## Architecture
-
-The library processes documents in two parallel streams:
-
-1. **Semantic Stream**: The PDF is sent to a Generative AI model (e.g., Gemini 2.5 Flash) to produce a clean Markdown representation.
-2. **Positional Stream**: The PDF is sent to Google Document AI to extract raw bounding boxes and text segments.
-
-These two streams are then merged using a custom alignment engine (`seq_smith` + `bbox_alignment.py`) which:
-
-1. Normalizes both text sources.
-2. Identifies "anchor" comparisons for reliable alignment.
-3. Computes a global alignment using the anchors to constrain the search space.
-4. Identifies significant gaps or mismatches.
-5. Recursively re-aligns mismatched regions until a high-quality alignment is achieved.
-
-**Key Features:**
-
-- **Robust to Cleanliness Issues:** Handles extra headers/footers, watermarks, and noisy OCR artifacts.
-- **Scale-Invariant:** Recursion ensures even small missed sections in large documents are recovered.
+- **`GeminiMarkdownProvider`** — generates Markdown via the Gemini API
+- **`DocAIMarkdownProvider`** — generates Markdown via Document AI Layout
+- **`DocAIAnchorProvider`** — extracts bounding boxes via Document AI OCR
+- **`DoclingMarkdownProvider`** — generates Markdown via Docling (stub)
 
 ## Quick Start
 
 ```python
 import asyncio
 from pathlib import Path
-from gemini_ocr import gemini_ocr, settings
+
+import anchorite
+from gemini_ocr import DocAIAnchorProvider, GeminiMarkdownProvider
 
 async def main():
-    # Configure settings
-    ocr_settings = settings.Settings(
-        project="my-gcp-project",
-        location="us",
-        gcp_project_id="my-gcp-project",
-        layout_processor_id="projects/.../processors/...",
-        ocr_processor_id="projects/.../processors/...",
-        mode=settings.OcrMode.GEMINI,
+    markdown_provider = GeminiMarkdownProvider(
+        project_id="my-gcp-project",
+        location="us-central1",
+        model_name="gemini-2.5-flash",
+    )
+    anchor_provider = DocAIAnchorProvider(
+        project_id="my-gcp-project",
+        location="us-central1",
+        processor_id="projects/.../processors/...",
     )
 
-    file_path = Path("path/to/document.pdf")
+    chunks = anchorite.document.chunks(Path("document.pdf"))
+    result = await anchorite.process_document(
+        chunks, markdown_provider, anchor_provider, renumber=True
+    )
 
-    # Process the document
-    result = await gemini_ocr.process_document(ocr_settings, file_path)
+    print(result.markdown_content)
+    print(result.annotate())   # Markdown with inline <span data-bbox="..."> tags
 
-    # Access results
-    print(f"Coverage: {result.coverage_percent:.2%}")
+asyncio.run(main())
+```
 
-    # Get annotated HTML-compatible Markdown
-    annotated_md = result.annotate()
-    print(annotated_md[:500])  # View first 500 chars
+## Configuration via Environment Variables
+
+`from_env()` builds providers from environment variables, useful for
+twelve-factor deployments:
+
+```python
+import anchorite
+from gemini_ocr import from_env
 
-if __name__ == "__main__":
-    asyncio.run(main())
+markdown_provider, anchor_provider = from_env()
+chunks = anchorite.document.chunks(Path("document.pdf"))
+result = await anchorite.process_document(chunks, markdown_provider, anchor_provider)
 ```
 
-## Configuration
-
-The `gemini_ocr.settings.Settings` class controls the behavior:
-
-| Parameter                        | Type      | Description                                                      |
-| :------------------------------- | :-------- | :--------------------------------------------------------------- |
-| `project`                        | `str`     | GCP Project Name                                                 |
-| `location`                       | `str`     | GCP Location (e.g., `us`, `eu`)                                  |
-| `gcp_project_id`                 | `str`     | GCP Project ID (might be same as `project`)                      |
-| `layout_processor_id`            | `str`     | Document AI Processor ID for Layout (if using `DOCUMENTAI` mode) |
-| `ocr_processor_id`               | `str`     | Document AI Processor ID for OCR (required for bounding boxes)   |
-| `mode`                           | `OcrMode` | `GEMINI` (default), `DOCUMENTAI`, or `DOCLING`                   |
-| `gemini_model_name`              | `str`     | Gemini model to use (default: `gemini-2.5-flash`)                |
-| `alignment_uniqueness_threshold` | `float`   | Min score ratio for unique match (default: `0.5`)                |
-| `alignment_min_overlap`          | `float`   | Min overlap fraction for valid match (default: `0.9`)            |
-| `include_bboxes`                 | `bool`    | Whether to perform alignment (default: `True`)                   |
-| `markdown_page_batch_size`       | `int`     | Pages per batch for Markdown generation (default: `10`)          |
-| `ocr_page_batch_size`            | `int`     | Pages per batch for OCR (default: `10`)                          |
-| `num_jobs`                       | `int`     | Max concurrent jobs (default: `10`)                              |
-| `cache_dir`                      | `str`     | Directory to store API response cache (default: `.docai_cache`)  |
+| Variable                          | Description                                                      |
+| :-------------------------------- | :--------------------------------------------------------------- |
+| `GEMINI_OCR_PROJECT_ID`           | GCP project ID (required)                                        |
+| `GEMINI_OCR_LOCATION`             | GCP location (default: `us-central1`)                            |
+| `GEMINI_OCR_MODE`                 | `gemini` (default), `documentai`, or `docling`                   |
+| `GEMINI_OCR_GEMINI_MODEL_NAME`    | Gemini model name (required in `gemini` mode)                    |
+| `GEMINI_OCR_LAYOUT_PROCESSOR_ID`  | Document AI processor ID (required in `documentai` mode)         |
+| `GEMINI_OCR_OCR_PROCESSOR_ID`     | Document AI OCR processor ID (enables bounding box extraction)   |
+| `GEMINI_OCR_DOCUMENTAI_LOCATION`  | Document AI endpoint location override                           |
+| `GEMINI_OCR_QUOTA_PROJECT_ID`     | Quota project override for Gemini API calls                      |
+| `GEMINI_OCR_GEMINI_PROMPT`        | Additional prompt appended to the default Gemini prompt          |
+| `GEMINI_OCR_CACHE_DIR`            | Directory for caching API responses                              |
+| `GEMINI_OCR_INCLUDE_BBOXES`       | Set to `false` to skip bounding box extraction (default: `true`) |
 
 ## License
 

diff --git a/docs/source/api.rst b/docs/source/api.rst
@@ -6,22 +6,22 @@ API Reference
    :undoc-members:
    :show-inheritance:
 
-.. automodule:: gemini_ocr.gemini_ocr
+.. automodule:: gemini_ocr.gemini
    :members:
    :undoc-members:
    :show-inheritance:
 
-.. automodule:: gemini_ocr.settings
+.. automodule:: gemini_ocr.docai_layout
    :members:
    :undoc-members:
    :show-inheritance:
 
-.. automodule:: gemini_ocr.document
+.. automodule:: gemini_ocr.docai_ocr
    :members:
    :undoc-members:
    :show-inheritance:
 
-.. automodule:: gemini_ocr.bbox_alignment
+.. automodule:: gemini_ocr.gemini_ocr
    :members:
    :undoc-members:
    :show-inheritance:
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "gemini-ocr"
-version = "0.4.0"
+version = "0.5.0"
 authors = [
     { name = "Tobias Sargeant", email = "tobias.sargeant@gmail.com" },
 ]
@@ -24,12 +24,13 @@ dependencies = [
     "google-genai",
     "google-cloud-documentai",
     "pymupdf",
-    "seq-smith>=0.5.1",
     "python-dotenv>=1.2.1",
     "fsspec",
     "gcsfs",
+    "anchorite==0.2.0",
 ]
 
+
 [dependency-groups]
 dev = [
     "pytest",
@@ -50,6 +51,7 @@ indent-width = 4
 select = ["A", "B", "C",  "E", "F", "G", "I", "N", "Q", "S", "W", "ANN", "ARG", "BLE", "COM", "DJ", "DTZ", "ERA", "EXE", "ICN", "ISC", "NPY", "PD", "PGH", "PIE", "PL", "PT", "PYI", "RET", "RSE", "RUF", "SIM", "SLF", "TCH", "TID", "UP", "YTT"]
 ignore = [
     "ANN101",  # missing-type-self
+    "COM812",  # conflicts with ruff format
     "PD011",   # pandas-use-of-dot-values (false positive)
 ]
 fixable = ["A", "B", "C", "D", "E", "F", "G", "I", "N", "Q", "S", "T", "W", "ANN", "ARG", "BLE", "COM", "DJ", "DTZ", "ERA", "EXE", "FBT", "ICN", "ISC", "NPY", "PD", "PGH", "PIE", "PL", "PT", "PYI", "RET", "RSE", "RUF", "SIM", "SLF", "TCH", "TID", "UP", "YTT"]

diff --git a/run_ocr.py b/run_ocr.py
@@ -6,11 +6,12 @@
 import sys
 import traceback
 
+import anchorite
 import dotenv
 import google.auth
 from google import genai
 
-from gemini_ocr import gemini_ocr, settings
+from gemini_ocr import DocAIAnchorProvider, DocAIMarkdownProvider, GeminiMarkdownProvider
 
 
 def _list_models(project: str | None, location: str, quota_project: str | None) -> None:
@@ -66,14 +67,18 @@ async def main() -> None:
     parser.add_argument(
         "--ocr-processor-id",
         default=os.environ.get("DOCUMENTAI_OCR_PROCESSOR_ID"),
-        help="Document AI OCR Processor ID (for secondary bbox pass)",
+        help="Document AI OCR Processor ID (for bounding box extraction)",
     )
     parser.add_argument(
         "--model",
         default=os.environ.get("GEMINI_OCR_GEMINI_MODEL_NAME"),
-        help="Gemini Model Name (e.g. gemini-2.0-flash-exp)",
+        help="Gemini Model Name (e.g. gemini-2.0-flash)",
+    )
+    parser.add_argument(
+        "--gemini-prompt",
+        default=None,
+        help="Additional instructions to append to the default Gemini prompt.",
     )
-
     parser.add_argument(
         "--output",
         type=pathlib.Path,
@@ -91,13 +96,11 @@ async def main() -> None:
         default="gemini",
         help="OCR generation mode",
     )
-
     parser.add_argument(
         "--list-models",
         action="store_true",
         help="List available Gemini models and exit",
     )
-
     parser.add_argument(
         "--no-bbox",
         action="store_true",
@@ -117,34 +120,49 @@ async def main() -> None:
         print("Error: --project or GOOGLE_CLOUD_PROJECT env var required.")
         sys.exit(1)
 
-    if not args.processor_id:
-        print("Error: --processor-id or DOCUMENTAI_LAYOUT_PARSER_PROCESSOR_ID env var required.")
-        sys.exit(1)
-
-    ocr_settings = settings.Settings(
-        project_id=args.project,
-        location=args.location,
-        quota_project_id=args.quota_project,
-        layout_processor_id=args.processor_id,
-        ocr_processor_id=args.ocr_processor_id,
-        gemini_model_name=args.model,
-        mode=args.mode,
-        include_bboxes=not args.no_bbox,
-        cache_dir=args.cache_dir,
-    )
+    cache_dir = str(args.cache_dir) if args.cache_dir else None
+
+    if args.mode == "gemini":
+        if not args.model:
+            print("Error: --model or GEMINI_OCR_GEMINI_MODEL_NAME required in gemini mode.")
+            sys.exit(1)
+        markdown_provider: anchorite.providers.MarkdownProvider = GeminiMarkdownProvider(
+            project_id=args.project,
+            location=args.location,
+            model_name=args.model,
+            quota_project_id=args.quota_project,
+            prompt=args.gemini_prompt,
+            cache_dir=cache_dir,
+        )
+    else:
+        if not args.processor_id:
+            print("Error: --processor-id required in documentai mode.")
+            sys.exit(1)
+        markdown_provider = DocAIMarkdownProvider(
+            project_id=args.project,
+            location=args.location,
+            processor_id=args.processor_id,
+            cache_dir=cache_dir,
+        )
+
+    anchor_provider: anchorite.providers.AnchorProvider | None = None
+    if not args.no_bbox and args.ocr_processor_id:
+        anchor_provider = DocAIAnchorProvider(
+            project_id=args.project,
+            location=args.location,
+            processor_id=args.ocr_processor_id,
+            cache_dir=cache_dir,
+        )
 
     print(f"Processing {args.input_pdf}...")
-    print(f"Settings: {ocr_settings}")
 
     try:
-        result = await gemini_ocr.process_document(args.input_pdf, settings=ocr_settings)
-
-        output_content = result.annotate() if ocr_settings.include_bboxes else result.markdown_content
-
-        output_path = args.output
-        output_path.write_text(output_content)
+        chunks = anchorite.document.chunks(args.input_pdf)
+        result = await anchorite.process_document(chunks, markdown_provider, anchor_provider, renumber=True)
 
-        print(f"Done! Output saved to {output_path}")
+        output_content = result.annotate() if anchor_provider else result.markdown_content
+        args.output.write_text(output_content)
+        print(f"Done! Output saved to {args.output}")
 
     except Exception as e:  # noqa: BLE001
         print(f"Error processing document: {e}")