Skip to content

AmitoVrito/chunkrank

Repository files navigation

ChunkRank

Model-aware text chunking and answer re-ranking for LLM pipelines

PyPI version PyPI downloads Python versions License Changelog Tests


Used internally for long-document QA and evaluation pipelines handling 1,000+ PDFs.

ChunkRank is a lightweight Python library that automatically chunks text based on an LLM's tokenizer and context window, then consolidates and ranks answers across chunks.

🔗 PyPI: https://pypi.org/project/chunkrank/


Why ChunkRank?

When working with LLMs, long documents must be split into chunks, but:

  • Every model has different tokenizers and context limits
  • Chunk sizes are usually hard-coded and error-prone
  • Answer quality drops when responses come from multiple chunks
  • Existing RAG frameworks are heavy when you only need chunking + ranking

ChunkRank solves this gap.


Installation

pip install chunkrank

With semantic chunking + cross-encoder reranking:

pip install chunkrank[semantic]

With all optional backends:

pip install chunkrank[all]

For development:

poetry install --with dev

Quick Example

import chunkrank

text = open("document.txt").read()
question = "What is the main topic of this document?"

chunks = chunkrank.split(text, model="gpt-4o-mini")
answers = chunkrank.answer(question, chunks)
best = chunkrank.rank(answers)

print(best)

Core API

import chunkrank

# 1. Split text into model-aware chunks
chunks = chunkrank.split(text, model="gpt-4o-mini")

# 2. Answer the question across all chunks
#    Default: local extractive (no API key required)
answers = chunkrank.answer(question, chunks)

#    With OpenAI:
answers = chunkrank.answer(question, chunks, provider="openai", api_key="sk-...")

#    With Anthropic:
answers = chunkrank.answer(question, chunks, provider="anthropic", api_key="sk-ant-...")

# 3. Rank and return the best answer
best_answer = chunkrank.rank(answers)

Pipeline API

from chunkrank import ChunkRankPipeline

# Local (no LLM required)
pipe = ChunkRankPipeline(model="gpt-4o-mini")

# With OpenAI
pipe = ChunkRankPipeline(model="gpt-4o-mini", provider="openai", api_key="sk-...")

# With Anthropic
pipe = ChunkRankPipeline(model="gpt-4o-mini", provider="anthropic", api_key="sk-ant-...")

# Process — returns best answer
answer = pipe.process(question="What is the main topic?", text=text)

# Stream — yields answers progressively as each chunk is processed
for partial in pipe.stream(question="What is the main topic?", text=text):
    print(partial)

Async API

from chunkrank import AsyncChunkRankPipeline

pipe = AsyncChunkRankPipeline(model="gpt-4o-mini", provider="openai", api_key="sk-...")

# Parallel chunk answering via asyncio.gather
answer = await pipe.process(question, text)

# Async streaming
async for partial in pipe.stream(question, text):
    print(partial)

Module-level async functions:

import chunkrank

chunks = await chunkrank.async_split(text, model="gpt-4o-mini")
answers = await chunkrank.async_answer(question, chunks)   # parallel LLM calls
best = await chunkrank.async_rank(answers)

Ranking Methods

Method Description Extra dep
bm25 (default) BM25 lexical ranking none
tfidf TF-IDF cosine similarity none
embedding Dense vector similarity [semantic] or openai-embed
cross-encoder Semantic cross-encoder (most accurate) [semantic]
from chunkrank import Ranker

ranker = Ranker(method="cross-encoder")
ranked = ranker.rank(question, answers)

Chunking Strategies

# Token-budget sliding window (default)
chunks = chunkrank.split(text, model="gpt-4o-mini", strategy="tokens", overlap_tokens=64)

# Semantic — splits on embedding similarity drops between sentences
chunks = chunkrank.split(text, model="gpt-4o-mini", strategy="semantic", similarity_threshold=0.5)

Retrieve-then-Answer (top-K)

Rank chunks first, answer only the top-K — reduces LLM calls on large documents:

pipe = ChunkRankPipeline(model="gpt-4o-mini", retrieval_top_k=3)
answer = pipe.process(question, text)

Disk Cache

Avoid re-chunking the same document on repeated runs:

from chunkrank import ChunkCache, Chunker, ChunkerConfig

cache = ChunkCache(".chunkrank_cache")
chunks = cache.get(text, model="gpt-4o-mini")
if chunks is None:
    chunks = Chunker(ChunkerConfig(model="gpt-4o-mini")).split(text)
    cache.set(text, model="gpt-4o-mini", chunks=chunks)

Runtime Model Registration

Register new models without editing the registry JSON:

import chunkrank

chunkrank.register_model("my-custom-model", max_context=200_000)

More Examples

Read from a file

import chunkrank

with open("report.txt") as f:
    text = f.read()

chunks = chunkrank.split(text, model="gpt-4o-mini")
answers = chunkrank.answer("What are the key findings?", chunks)
print(chunkrank.rank(answers))

Batch processing multiple documents

import chunkrank

files = ["doc1.txt", "doc2.txt", "doc3.txt"]
question = "What is the main conclusion?"

for path in files:
    text = open(path).read()
    chunks = chunkrank.split(text, model="gpt-4o-mini")
    answers = chunkrank.answer(question, chunks)
    print(f"{path}: {chunkrank.rank(answers)}")

Custom chunker config

from chunkrank import Chunker, ChunkerConfig

config = ChunkerConfig(
    model="claude-sonnet-4-6",
    strategy="tokens",
    overlap_tokens=128,
    reserve_tokens=1024,
)
chunker = Chunker(config)
chunks = chunker.split(text)
print(f"{len(chunks)} chunks created")

Semantic chunking with similarity threshold

# pip install chunkrank[semantic]
chunks = chunkrank.split(
    text,
    model="gpt-4o-mini",
    strategy="semantic",
    similarity_threshold=0.6,  # higher = fewer, larger chunks
)

Compare all ranking methods

from chunkrank import Ranker

question = "What is the capital of France?"
answers = ["Paris is the capital.", "France is in Europe.", "The city of Paris."]

for method in ["bm25", "tfidf", "embedding", "cross-encoder"]:
    ranker = Ranker(method=method)
    best = ranker.rank(question, answers)
    print(f"{method}: {best}")

Async batch with asyncio.gather

import asyncio
from chunkrank import AsyncChunkRankPipeline

async def process_all(docs, question):
    pipe = AsyncChunkRankPipeline(
        model="gpt-4o-mini", provider="openai", api_key="sk-..."
    )
    tasks = [pipe.process(question, doc) for doc in docs]
    return await asyncio.gather(*tasks)

results = asyncio.run(process_all(docs, "What is the summary?"))

FastAPI streaming endpoint

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from chunkrank import ChunkRankPipeline

app = FastAPI()
pipe = ChunkRankPipeline(model="gpt-4o-mini", provider="openai", api_key="sk-...")

@app.post("/ask")
def ask(question: str, text: str):
    return StreamingResponse(
        pipe.stream(question=question, text=text),
        media_type="text/plain",
    )

Inspect a model's context info

import chunkrank

info = chunkrank.get_model_info("gpt-4o")
print(info)
# {'name': 'gpt-4o', 'max_context': 128000, 'tokenizer': 'tiktoken', ...}

Top-K retrieval + cross-encoder reranking

from chunkrank import ChunkRankPipeline

# Rank chunks first, then answer only the top 3 — fewer LLM calls on large docs
pipe = ChunkRankPipeline(
    model="claude-sonnet-4-6",
    provider="anthropic",
    api_key="sk-ant-...",
    retrieval_top_k=3,
)
answer = pipe.process("What is the conclusion?", text)

Register and use a custom model

import chunkrank

chunkrank.register_model(
    "my-llm-v2",
    max_context=512_000,
    tokenizer="tiktoken",
    tokenizer_id="o200k_base",
    default_reserve=1024,
)

chunks = chunkrank.split(text, model="my-llm-v2")
print(f"Split into {len(chunks)} chunks")

Disk cache with custom chunker

from chunkrank import ChunkCache, Chunker, ChunkerConfig

cache = ChunkCache(".chunkrank_cache")
chunks = cache.get(text, model="gpt-4o")

if chunks is None:
    config = ChunkerConfig(model="gpt-4o", overlap_tokens=64)
    chunks = Chunker(config).split(text)
    cache.set(text, model="gpt-4o", chunks=chunks)

answers = chunkrank.answer(question, chunks)
print(chunkrank.rank(answers))

Supported Models

90 models in the built-in registry, including:

Provider Models
OpenAI gpt-4o, gpt-4o-mini, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-4-turbo, gpt-4, gpt-3.5-turbo, o1, o1-mini, o1-pro, o3, o3-mini, o4-mini
Anthropic claude-3-opus, claude-3-sonnet, claude-3-haiku, claude-3-5-sonnet, claude-3-5-haiku, claude-haiku-4-5, claude-sonnet-4-6, claude-opus-4-6
Google gemini-1.0-pro, gemini-1.5-pro, gemini-1.5-flash, gemini-2.0-flash, gemini-2.0-flash-lite, gemini-2.5-pro, gemini-2.5-flash
Meta Llama-3.1-8B/70B/405B, Llama-3.2-1B/3B/11B/90B, Llama-3.3-70B, Llama-4-Scout (10M ctx), Llama-4-Maverick
Mistral mistral-7b, mistral-small, mistral-nemo, mistral-large, mixtral-8x7b, mixtral-8x22b, codestral, pixtral-large
Microsoft phi-3-mini, phi-3-medium, phi-4, phi-4-mini
xAI grok-2, grok-3
Cohere command-r, command-r-plus, command-r7b, command-a
DeepSeek deepseek-v2, deepseek-v3, deepseek-r1, deepseek-r1-distill-qwen-32b
Qwen qwen3-72b, qwen2.5-7b/14b/32b/72b-instruct, qwen2.5-coder-32b-instruct
IBM granite-3.3-2b-instruct, granite-3.3-8b-instruct
EleutherAI gpt-neo-2.7B, gpt-j-6B
Falcon falcon-40b, falcon-180b
HuggingFace BERT, DistilBERT, DeBERTa-v3, BigBird, Longformer, T5, FLAN-T5

Unknown models fall back to 128k context with tiktoken (o200k_base).


How It Fits

Tool What it does
LangChain / LlamaIndex Full RAG pipelines
Haystack End-to-end retrieval frameworks
ChunkRank Focused, model-aware chunking + answer ranking

ChunkRank complements RAG frameworks — it doesn't replace them.


Requirements

  • Python 3.10+
  • numpy, scikit-learn, rank-bm25

License

Apache 2.0 — see LICENCE.


Community

About

Model-aware text chunking and answer re-ranking for LLM pipelines. Python 3.10+, async-first, 54 models.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages