Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.10", "3.11", "3.12"]
python-version: ["3.10", "3.11", "3.12", "3.13"]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
Expand Down
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,26 @@ All notable changes to **LongParser** are documented here.
This project follows [Semantic Versioning](https://semver.org/) and
[Keep a Changelog](https://keepachangelog.com/en/1.1.0/).

## [0.1.3] — 2026-04-13

### Fixed

- **Source code**: Added `DocumentPipeline` as a public alias for `PipelineOrchestrator` —
docs, quickstart, and all examples now use this name consistently
- **Documentation**: Fixed wrong coverage path `long_parser` → `longparser` in `CONTRIBUTING.md`
- **Documentation**: Replaced stale `cleanrag-api` reference in Docker deployment docs
- **Documentation**: Standardized Gemini API key env var to `GOOGLE_API_KEY` across all docs
- **Source code**: Updated default LLM model fallback from `gpt-4o` to `gpt-5.3` in
`schemas.py`, `llm_chain.py`, and `engine.py`
- **Source code**: Renamed stale `cleanrag:` Redis key prefix to `longparser:` in embeddings

### Changed

- Python 3.13 added to CI matrix, badges, and installation docs
- `SECURITY.md` updated with Redis rate-limiting and CORS threat mitigations

---

## [0.1.2] — 2026-04-05

### Changed
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ Use Python 3.10+ type hints. All public API must be fully annotated.
uv run pytest tests/unit/ -v

# With coverage:
uv run pytest tests/unit/ --cov=src/long_parser --cov-report=term-missing
uv run pytest tests/unit/ --cov=src/longparser --cov-report=term-missing

# Full test suite (requires MongoDB + Redis):
uv run pytest tests/ -v
Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
<img src="https://static.pepy.tech/badge/longparser/month" alt="Monthly Downloads">
</a>
<a href="https://www.python.org/">
<img src="https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue.svg" alt="Python">
<img src="https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12%20%7C%203.13-blue.svg" alt="Python">
</a>
<a href="LICENSE">
<img src="https://img.shields.io/badge/License-MIT-brightgreen.svg" alt="MIT License">
Expand Down Expand Up @@ -105,9 +105,9 @@ pip install "longparser[cpu]"
### Python SDK

```python
from longparser import PipelineOrchestrator, ProcessingConfig
from longparser import DocumentPipeline, ProcessingConfig

pipeline = PipelineOrchestrator()
pipeline = DocumentPipeline(ProcessingConfig())
result = pipeline.process_file("document.pdf")

print(f"Pages: {result.document.metadata.total_pages}")
Expand Down Expand Up @@ -186,7 +186,7 @@ src/longparser/
├── schemas.py ← core Pydantic models (Document, Block, Chunk, …)
├── extractors/ ← Docling, LaTeX OCR backends
├── chunkers/ ← HybridChunker
├── pipeline/ ← PipelineOrchestrator
├── pipeline/ ← DocumentPipeline
├── integrations/ ← LangChain loader & LlamaIndex reader
├── utils/ ← shared helpers (RTL detection, …)
└── server/ ← REST API layer
Expand Down
2 changes: 2 additions & 0 deletions SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ Key risks:
| **MongoDB injection** | Motor driver + typed Pydantic inputs prevent injection |
| **SSRF via webhook** | No outbound HTTP made based on user input |
| **Hallucinated citations** | Citation IDs validated against retrieved set before returning to client |
| **DDoS / Spam via API** | Route-level Rate Limiting strictly isolated per tenant via Redis |
| **Cross-Origin attacks** | Configurable CORS restrictions and strict Tenant Isolation |

## Dependency Security

Expand Down
20 changes: 20 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,26 @@ All notable changes to **LongParser** are documented here.
This project follows [Semantic Versioning](https://semver.org/) and
[Keep a Changelog](https://keepachangelog.com/en/1.1.0/).

## [0.1.3] — 2026-04-13

### Fixed

- **Source code**: Added `DocumentPipeline` as a public alias for `PipelineOrchestrator` —
docs, quickstart, and all examples now use this name consistently
- **Documentation**: Fixed wrong coverage path `long_parser` → `longparser` in `CONTRIBUTING.md`
- **Documentation**: Replaced stale `cleanrag-api` reference in Docker deployment docs
- **Documentation**: Standardized Gemini API key env var to `GOOGLE_API_KEY` across all docs
- **Source code**: Updated default LLM model fallback from `gpt-4o` to `gpt-5.3` in
`schemas.py`, `llm_chain.py`, and `engine.py`
- **Source code**: Renamed stale `cleanrag:` Redis key prefix to `longparser:` in embeddings

### Changed

- Python 3.13 added to CI matrix, badges, and installation docs
- `SECURITY.md` updated with Redis rate-limiting and CORS threat mitigations

---

## [0.1.2] — 2026-04-05

### Changed
Expand Down
2 changes: 1 addition & 1 deletion docs/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ Use Python 3.10+ type hints. All public API must be fully annotated.
uv run pytest tests/unit/ -v

# With coverage:
uv run pytest tests/unit/ --cov=src/long_parser --cov-report=term-missing
uv run pytest tests/unit/ --cov=src/longparser --cov-report=term-missing

# Full test suite (requires MongoDB + Redis):
uv run pytest tests/ -v
Expand Down
2 changes: 1 addition & 1 deletion docs/deployment/docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,5 +49,5 @@ docker compose up --scale longparser=3

```bash
curl http://localhost:8000/health
# {"status": "ok", "service": "cleanrag-api"}
# {"status": "ok", "service": "longparser-api"}
```
2 changes: 1 addition & 1 deletion docs/deployment/environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Copy `.env.example` to `.env` and configure for your deployment.
| `LONGPARSER_LLM_PROVIDER` | `openai` | LLM provider |
| `LONGPARSER_LLM_MODEL` | _(provider default)_ | Model name |
| `OPENAI_API_KEY` | — | OpenAI API key |
| `GEMINI_API_KEY` | — | Google Gemini API key |
| `GOOGLE_API_KEY` | — | Google Gemini API key |
| `GROQ_API_KEY` | — | Groq API key |
| `OPENROUTER_API_KEY` | — | OpenRouter API key |

Expand Down
2 changes: 1 addition & 1 deletion docs/getting-started/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ cp .env.example .env
|---|---|
| `LONGPARSER_LLM_PROVIDER` | `openai` / `gemini` / `groq` / `openrouter` |
| `LONGPARSER_LLM_MODEL` | Model name (uses provider default if unset) |
| `GEMINI_API_KEY` | For Google Gemini |
| `GOOGLE_API_KEY` | For Google Gemini |
| `GROQ_API_KEY` | For Groq |

## Vector Store
Expand Down
4 changes: 2 additions & 2 deletions docs/getting-started/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Requirements

- Python 3.10, 3.11, or 3.12
- Python 3.10, 3.11, 3.12, or 3.13
- Tesseract OCR (`brew install tesseract` / `apt install tesseract-ocr`)

---
Expand Down Expand Up @@ -104,5 +104,5 @@ The server starts on `http://localhost:8000`.

```python
import longparser
print(longparser.__version__) # 0.1.2
print(longparser.__version__) # 0.1.3
```
8 changes: 4 additions & 4 deletions docs/getting-started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,11 @@ from longparser import DocumentPipeline, ProcessingConfig
pipeline = DocumentPipeline(ProcessingConfig())

# Parse a PDF
doc = pipeline.process("research_paper.pdf")
result = pipeline.process_file("research_paper.pdf")

print(f"Pages: {len(doc.pages)}")
print(f"Blocks: {len(doc.blocks)}")
print(f"Chunks: {len(doc.chunks)}")
print(f"Pages: {result.document.metadata.total_pages}")
print(f"Chunks: {len(result.chunks)}")
print(result.chunks[0].text)
```

## 3. Inspect Chunks
Expand Down
2 changes: 1 addition & 1 deletion docs/guide/chat.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,6 @@ Every answer's `cited_chunk_ids` are validated against the retrieved set. IDs no
| Provider | Key |
|---|---|
| OpenAI | `OPENAI_API_KEY` |
| Google Gemini | `GEMINI_API_KEY` |
| Google Gemini | `GOOGLE_API_KEY` |
| Groq | `GROQ_API_KEY` |
| OpenRouter | `OPENROUTER_API_KEY` |
8 changes: 4 additions & 4 deletions docs/guide/parsing.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ LongParser uses **Docling** with Tesseract CLI OCR as its extraction engine —
from longparser import DocumentPipeline, ProcessingConfig

pipeline = DocumentPipeline(ProcessingConfig())
doc = pipeline.process("paper.pdf")
result = pipeline.process_file("paper.pdf")
```

## Formula Modes
Expand All @@ -36,15 +36,15 @@ config = ProcessingConfig(formula_mode="smart")

```python
# Pages
for page in doc.pages:
for page in result.document.pages:
print(f"Page {page.page_number}: {page.width}x{page.height}")

# Blocks (semantic units)
for block in doc.blocks:
for block in result.document.blocks:
print(f"[{block.type}] p={block.provenance.page_number}: {block.text[:80]}")

# Chunks (RAG-ready)
for chunk in doc.chunks:
for chunk in result.chunks:
print(f"{chunk.chunk_type} | {chunk.token_count} tokens | pages={chunk.page_numbers}")
```

Expand Down
7 changes: 4 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
<img src="https://static.pepy.tech/badge/longparser/month" alt="Monthly Downloads">
</a>&nbsp;
<a href="https://www.python.org/">
<img src="https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue.svg" alt="Python">
<img src="https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12%20%7C%203.13-blue.svg" alt="Python">
</a>&nbsp;
<a href="https://github.com/ENDEVSOLS/LongParser/blob/main/LICENSE">
<img src="https://img.shields.io/badge/License-MIT-brightgreen.svg" alt="MIT License">
Expand Down Expand Up @@ -57,9 +57,10 @@ pip install longparser
from longparser import DocumentPipeline, ProcessingConfig

pipeline = DocumentPipeline(ProcessingConfig())
doc = pipeline.process("report.pdf")
result = pipeline.process_file("report.pdf")

print(f"Extracted {len(doc.blocks)} blocks, {len(doc.chunks)} chunks")
print(f"Chunks: {len(result.chunks)}")
print(result.chunks[0].text)
```

---
Expand Down
28 changes: 19 additions & 9 deletions docs/reference/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,39 +7,49 @@ The `DocumentPipeline` is the main entry point for LongParser's extraction pipel
```python
from longparser import DocumentPipeline, ProcessingConfig

pipeline = DocumentPipeline(config=ProcessingConfig())
doc = pipeline.process("document.pdf")
pipeline = DocumentPipeline(ProcessingConfig())
result = pipeline.process_file("document.pdf")
```

### Constructor

```python
DocumentPipeline(config: ProcessingConfig)
DocumentPipeline(config: ProcessingConfig | None = None)
```

| Parameter | Type | Description |
|---|---|---|
| `config` | `ProcessingConfig` | Extraction and chunking configuration |
| `config` | `ProcessingConfig \| None` | Extraction and chunking configuration (uses defaults if `None`) |

### Methods

#### `process(file_path)`
#### `process_file(file_path)`

Process a document end-to-end through Extract → Validate → Chunk.

```python
doc = pipeline.process("report.pdf")
# Returns: longparser.schemas.Document
result = pipeline.process_file("report.pdf")
# Returns: longparser.pipeline.PipelineResult
```

**Returns:** `Document` with `.pages`, `.blocks`, `.chunks` populated.
**Returns:** `PipelineResult` with `.document` and `.chunks` populated.

#### `process(request)`

Process a document from a `JobRequest` object.

```python
from longparser import JobRequest
request = JobRequest(file_path="report.pdf")
result = pipeline.process(request)
```

#### `process_batch(file_paths)`

Process multiple documents sequentially.

```python
docs = pipeline.process_batch(["a.pdf", "b.docx", "c.pptx"])
results = pipeline.process_batch(["a.pdf", "b.docx", "c.pptx"])
```

## ProcessingConfig
Expand Down
2 changes: 1 addition & 1 deletion docs/reference/schemas.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Core data models used throughout LongParser.

## Document

Top-level container returned by `DocumentPipeline.process()`.
Top-level container returned by `DocumentPipeline.process_file()`.

```python
class Document:
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "longparser"
version = "0.1.2"
version = "0.1.3"
description = "Privacy-first document intelligence engine — converts PDFs, DOCX, PPTX, XLSX, and CSV into AI-ready Markdown + structured JSON for RAG pipelines."
readme = {file = "README.md", content-type = "text/markdown"}
requires-python = ">=3.10"
Expand Down
12 changes: 8 additions & 4 deletions src/longparser/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,23 +9,23 @@

Quick start::

from longparser import PipelineOrchestrator, ProcessingConfig
from longparser import DocumentPipeline, ProcessingConfig

pipeline = PipelineOrchestrator()
pipeline = DocumentPipeline(ProcessingConfig())
result = pipeline.process_file("document.pdf")
print(result.chunks[0].text)

For the full REST API server::

uv run uvicorn longparser.server.app:app --reload --port 8000

See :class:`~longparser.pipeline.PipelineOrchestrator` for the main SDK entry
See :class:`~longparser.pipeline.DocumentPipeline` for the main SDK entry
point and :mod:`longparser.server` for the REST API layer.
"""

from __future__ import annotations

__version__ = "0.1.2"
__version__ = "0.1.3"
__author__ = "ENDEVSOLS Team"
__license__ = "MIT"

Expand Down Expand Up @@ -62,6 +62,9 @@ def __getattr__(name: str):
if name == "PipelineOrchestrator":
from .pipeline import PipelineOrchestrator
return PipelineOrchestrator
if name == "DocumentPipeline":
from .pipeline import DocumentPipeline
return DocumentPipeline
if name == "PipelineResult":
from .pipeline import PipelineResult
return PipelineResult
Expand Down Expand Up @@ -99,6 +102,7 @@ def __getattr__(name: str):
# Lazily imported (require extras)
"DoclingExtractor",
"PipelineOrchestrator",
"DocumentPipeline",
"PipelineResult",
"HybridChunker",
]
4 changes: 4 additions & 0 deletions src/longparser/pipeline/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@

from .orchestrator import PipelineOrchestrator, PipelineResult

# Public alias — docs and quickstart use this name
DocumentPipeline = PipelineOrchestrator

__all__ = [
"PipelineOrchestrator",
"DocumentPipeline",
"PipelineResult",
]
4 changes: 2 additions & 2 deletions src/longparser/server/chat/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@
# Token Counting (model-aware) — kept as custom logic
# ---------------------------------------------------------------------------

def count_tokens(text: str, model: str = "gpt-4o") -> int:
def count_tokens(text: str, model: str = "gpt-5.3") -> int:
"""Count tokens — exact for OpenAI models, conservative approx for others."""
try:
import tiktoken
Expand All @@ -96,7 +96,7 @@ def budget_trim(
recent_turns: list[dict],
rolling_summary: str,
long_term_facts: list[dict],
model: str = "gpt-4o",
model: str = "gpt-5.3",
max_prompt_tokens: int = 6000,
) -> dict:
"""Priority-ordered truncation of prompt variables to fit token budget.
Expand Down
2 changes: 1 addition & 1 deletion src/longparser/server/chat/llm_chain.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ def get_chat_model(
"""
config = config or ChatConfig()
provider = provider or config.llm_provider
model = model or config.llm_model or DEFAULT_MODELS.get(provider, "gpt-4o")
model = model or config.llm_model or DEFAULT_MODELS.get(provider, "gpt-5.3")
max_tokens = max_tokens or config.max_output_tokens

creator = _CREATORS.get(provider)
Expand Down
Loading
Loading