Skip to content

Pedromanuelsilva/CrossPath

Repository files navigation

CrossPath

CrossPath is a Tika-compatible proxy service for an internal enterprise search platform built around Apache ManifoldCF and OpenSearch.

This repository is only for the proxy layer, not the full search platform. The broader system uses ManifoldCF to crawl Windows file shares, preserve Active Directory-backed ACL visibility, extract document content, and send indexed content to OpenSearch for search portal use.

Within that larger architecture, CrossPath sits between ManifoldCF's tikaservice transformation connector and the actual extraction backends. It exposes Tika-compatible endpoints so ManifoldCF can treat it like a normal Tika service, while CrossPath applies routing and fallback rules behind the scenes.

Role In The Larger System

At a high level, the full platform looks like this:

Windows SMB shares -> ManifoldCF -> OpenSearch -> search portal

Within ManifoldCF's extraction path, CrossPath sits between the tikaservice transformation connector and the extraction backends:

ManifoldCF tikaservice -> CrossPath -> Docling / Tika

CrossPath may also route recognized document patterns to custom extractors when generic extraction is not sufficient. These rules are intended to be defined in code, using lightweight detection logic and a routing registry rather than an external rule database.

CrossPath is responsible for:

  • exposing Tika-compatible endpoints: PUT /meta, PUT /tika, PUT /detect/stream
  • routing documents to the best extraction backend based on file type and policy
  • using Docling for formats where richer structured extraction is preferred
  • falling back to Apache Tika for legacy, unsupported, or failed Docling cases
  • returning Tika-compatible responses back to ManifoldCF

Scope

In scope for this repository:

  • the Python proxy service
  • document-specific extraction rules and custom extractor orchestration
  • compatibility with ManifoldCF tikaservice
  • routing, buffering, fallback, normalization, and observability
  • orchestration of Docling Serve and Tika Server as external services

Out of scope for this repository:

  • crawling Windows shares
  • Active Directory authority handling
  • OpenSearch indexing logic
  • search portal / UI
  • end-user authorization model outside the metadata and ACL information already handled by the surrounding platform

Design Direction

CrossPath is intended to remain a small compatibility and orchestration layer rather than a heavy generic parser itself.

Current direction:

  • runtime: Python 3.13+
  • framework: FastAPI
  • deployment: containerized
  • extractors: Docling Serve and Apache Tika as external services
  • parser backends should plug into a stable adapter/registry layer so new parsers can be added without changing endpoint structure
  • backend preference and fallback should live in routing policy definitions, not in HTTP handlers

Document-specific handling is a core planned capability. CrossPath should support known file patterns such as specific PDFs, spreadsheets, and structured exports, route them to custom extractors or normalization logic, and still return plain text and metadata in a Tika-compatible way for ManifoldCF.

About

Routing logic between different extraction engines.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors