Progressive Summarizer RAPTOR

Progressive Summarizer RAPTOR (Recursive API for Progressive Text Organization and Refinement) is an advanced text summarization system that creates hierarchical summaries through recursive refinement. It condenses documents while preserving essential information across multiple levels of abstraction, enabling users to navigate between different levels of detail seamlessly.

Based on the paper https://arxiv.org/abs/2401.18059 and the code from https://github.com/run-llama/llama_index/tree/main/llama-index-packs/llama-index-packs-raptor, this innovative solution leverages state-of-the-art embedding models and Large Language Models (LLMs) to understand semantic relationships within text, producing summaries that maintain coherence while achieving significant compression ratios. Unlike traditional single-pass summarization tools, RAPTOR creates a progressive hierarchy where each level provides increasingly condensed versions of the original content.

The system is designed for production environments, offering a robust REST API, Docker support, and intelligent resource management. Whether you're processing research papers, technical documentation, or business reports, RAPTOR provides a scalable solution for extracting key insights at various levels of detail.

Key Features

Recursive Progressive Summarization: Creates three hierarchical levels of summaries, each more condensed than the previous
Semantic Preservation: Uses sentence transformers to maintain semantic coherence throughout the summarization process
LLM Integration: Seamlessly integrates with Ollama for local LLM deployment
GPU Acceleration: CUDA-enabled for fast embedding generation and processing
Intelligent Resource Management: Optimizes CPU and memory usage based on available system resources
Production-Ready API: FastAPI-based REST interface with automatic documentation and validation
Docker Integration: Easy deployment with Docker and docker-compose for both CPU and GPU environments
Configurable Processing: Adjustable parameters for model selection, temperature, token limits, and processing options (summarization hierarchy is fixed at 3 levels)
Model Caching: Efficient model management with lifespan context managers for improved performance
Comprehensive Logging: Detailed logging with rotating file handlers for debugging and monitoring
Thread-Safe Processing: Concurrent processing capabilities with proper resource management

How the Summarization Algorithm Works

The Pipeline

RAPTOR implements a sophisticated multi-stage pipeline that combines embedding-based semantic analysis with LLM-powered text generation:

Input Processing: The API accepts JSON documents containing text chunks to be summarized through the /raptor/ endpoint
Embedding Generation: Each text segment is converted into high-dimensional vector representations using sentence transformers (configurable via embedder_model parameter)
Semantic Clustering (Level 1): The system performs dimensionality reduction and clustering to group semantically related segments
Initial Summarization: First-level summaries are generated for each cluster using carefully crafted prompts sent to the LLM (configurable via llm_model and temperature parameters)
Recursive Clustering (Level 2): Level 1 summaries undergo a second round of embedding and clustering to identify higher-level relationships
Intermediate Summarization: Second-level summaries are generated from the Level 2 clusters
Final Consolidation (Level 3): All Level 2 summaries are combined and processed to create a comprehensive final summary
Token Optimization: Summaries at each level can be optimized to stay within token limits (configurable via threshold_tokens parameter)
Hierarchical Output: The system returns all three levels of summaries with detailed metadata including processing time and reduction ratios

Hierarchical Clustering and Summarization Process

The core innovation of RAPTOR lies in its hierarchical clustering and multi-level summarization approach:

# Conceptual representation of the RAPTOR process
def raptor_process(chunks):
    # Level 1: Initial clustering and summarization
    chunks_embedded = get_embeddings(chunks)
    level1_clusters = perform_clustering(chunks_embedded)
    level1_summaries = []
    
    # Generate summaries for each Level 1 cluster
    for cluster in level1_clusters:
        cluster_text = concatenate_cluster_chunks(cluster)
        summary = generate_summary(cluster_text)
        level1_summaries.append(summary)
    
    # Level 2: Cluster the Level 1 summaries
    level1_embedded = get_embeddings(level1_summaries)
    level2_clusters = perform_clustering(level1_embedded)
    level2_summaries = []
    
    # Generate summaries for each Level 2 cluster
    for cluster in level2_clusters:
        cluster_text = concatenate_cluster_chunks(cluster)
        summary = generate_summary(cluster_text)
        level2_summaries.append(summary)
    
    # Level 3: Final consolidation
    final_text = " ".join(level2_summaries)
    final_summary = generate_summary(final_text)
    
    return {
        "level1": level1_summaries,
        "level2": level2_summaries,
        "level3": [final_summary]
    }

This approach ensures semantic coherence across multiple levels of abstraction while progressively condensing information. Rather than a simple recursive function, RAPTOR implements a sophisticated pipeline that combines semantic clustering with LLM-powered summarization at each level.

Semantic Understanding Through Embeddings

RAPTOR leverages transformer-based embedding models to capture semantic meaning:

Vector Representations: Text segments are converted to dense vectors that capture semantic relationships
Similarity Measurement: Cosine similarity between embeddings guides the summarization process
Concept Preservation: Key concepts are identified and preserved across summarization levels
Contextual Understanding: The system maintains contextual relationships between different parts of the text

Comparison with Traditional Summarization

Feature	Traditional Summarization	Progressive-Summarizer-RAPTOR
Approach	Single-pass extraction or abstraction	Multi-level recursive refinement
Detail Levels	One fixed level	Multiple navigable levels
Semantic Analysis	Limited or rule-based	Deep embedding-based understanding
Context Preservation	Often loses nuanced context	Maintains context through hierarchy
Customization	Limited parameters	Highly configurable depth and detail
Scalability	Linear complexity	Optimized recursive processing
Information Access	All-or-nothing	Progressive zoom in/out capability

Advantages of the Solution

Information Hierarchy

RAPTOR creates a natural information hierarchy that mirrors human understanding:

Progressive Detail: Users can start with high-level overviews and drill down as needed
Preserved Structure: Document structure and logical flow are maintained across levels
Contextual Navigation: Each level provides sufficient context to understand the next
Flexible Consumption: Different stakeholders can access appropriate detail levels

Superior Performance

The system is optimized for production environments:

GPU Acceleration: Leverages CUDA for fast embedding generation
Efficient Caching: Models are loaded once and reused throughout the application lifecycle
Concurrent Processing: Thread-safe implementation allows parallel document processing
Resource Optimization: Intelligent allocation of CPU cores and memory
Scalable Architecture: Stateless API design enables horizontal scaling

Flexibility and Customization

RAPTOR adapts to diverse use cases:

Model Selection: Choose from various embedding and LLM models
Depth Control: Configure the number of summarization levels
Parameter Tuning: Adjust temperature, context windows, and other generation parameters
Integration Options: REST API allows integration with any programming language or platform
Deployment Flexibility: Run locally, in containers, or in cloud environments

Installation and Deployment

Prerequisites

Docker and Docker Compose (for Docker deployment)
NVIDIA GPU with CUDA support (recommended for performance)
NVIDIA Container Toolkit (for GPU passthrough in Docker)
Python 3.12 (for local installation)
Ollama installed and running (for LLM functionality)

Getting the Code

Before proceeding with any installation method, clone the repository:

git clone https://github.com/smart-models/Progressive-Summarizer-RAPTOR.git
cd Progressive-Summarizer-RAPTOR

Local Installation with Uvicorn

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Linux/Mac

For Windows users:

Using Command Prompt:

venv\Scripts\activate.bat

Using PowerShell:

# If you encounter execution policy restrictions, run this once per session:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope Process

# Then activate the virtual environment:
venv\Scripts\Activate.ps1

Install dependencies:
```
pip install -r requirements.txt
```

Run the FastAPI server:

uvicorn raptor_api:app --reload --port 8002

The API will be available at http://localhost:8002.

Access the API documentation and interactive testing interface at http://localhost:8002/docs.

Option B: Docker Compose (Local Build)

Create required directories for persistent storage:
```
# Linux/macOS
mkdir -p models logs

# Windows CMD
mkdir models
mkdir logs

# Windows PowerShell
New-Item -ItemType Directory -Path models -Force
New-Item -ItemType Directory -Path logs -Force
```
Note: Docker Compose mounts three named volumes automatically: raptor_models (downloaded embedding models), raptor_logs (application logs), and raptor_cache (Hugging Face / PyTorch caches). The models and logs directories above are for reference only; data is persisted in Docker named volumes.

Deploy with Docker Compose:

CPU-only deployment:

cd docker
docker compose --profile cpu up -d

GPU-accelerated deployment (requires NVIDIA GPU):

cd docker
docker compose --profile gpu up -d

Stopping the service:

# To stop CPU deployment
docker compose --profile cpu down

# To stop GPU deployment
docker compose --profile gpu down

# To stop CPU (external Ollama) deployment
docker compose --profile cpu-external down

# To stop GPU (external Ollama) deployment
docker compose --profile gpu-external down

The API will be available at http://localhost:8002 (configurable via APP_PORT).

Option A: Pre-built Image from GitHub Container Registry

The easiest way to deploy is using our pre-built Docker images published to GitHub Container Registry.

Pull the latest image:

docker pull ghcr.io/smart-models/progressive-summarizer-raptor:latest

Run with GPU acceleration (recommended, requires NVIDIA GPU + drivers):

docker run -d \
  --name progressive-summarizer-raptor \
  --gpus all \
  -p 8002:8000 \
  -v $(pwd)/logs:/app/logs \
  ghcr.io/smart-models/progressive-summarizer-raptor:latest

Windows PowerShell:

docker run -d `
  --name progressive-summarizer-raptor `
  --gpus all `
  -p 8002:8000 `
  -v ${PWD}/logs:/app/logs `
  ghcr.io/smart-models/progressive-summarizer-raptor:latest

Run on CPU only (fallback for systems without GPU):

docker run -d \
  --name progressive-summarizer-raptor \
  -p 8002:8000 \
  -v $(pwd)/logs:/app/logs \
  ghcr.io/smart-models/progressive-summarizer-raptor:latest

Use a specific version (recommended for production):

# Replace v1.0.0 with your desired version
docker pull ghcr.io/smart-models/progressive-summarizer-raptor:v1.0.0
docker run -d --gpus all -p 8002:8000 \
  -v $(pwd)/logs:/app/logs \
  ghcr.io/smart-models/progressive-summarizer-raptor:v1.0.0

Verify the service is running:

curl http://localhost:8002/

Stop and remove the container:

docker stop progressive-summarizer-raptor
docker rm progressive-summarizer-raptor

Using an external Ollama instance

If you already have Ollama running (local network, cloud VM, managed service, etc.), use the cpu-external or gpu-external profile so the bundled Ollama containers are not started.

Set OLLAMA_BASE_URL in docker/.env:

OLLAMA_BASE_URL=http://192.168.1.10:11434

Then start raptor:

cd docker

# CPU (embedder runs on CPU, Ollama is external)
docker compose --profile cpu-external up -d

# GPU (embedder uses GPU, Ollama is external)
docker compose --profile gpu-external up -d

If your external Ollama instance requires authentication, also set OLLAMA_API_KEY in docker/.env:

OLLAMA_API_KEY=your-token-here

The API key is forwarded as Authorization: Bearer <key> on every request to Ollama.

Ollama Setup

For Docker Deployment

If you're using the Docker deployment method, Ollama is automatically included in the docker-compose configuration. The docker-compose.yml file defines an ollama service that:

Uses the official ollama/ollama:latest image
Is configured to work with both CPU and GPU profiles
Has GPU passthrough enabled when using the GPU profile
Automatically connects to the RAPTOR service

No additional Ollama setup is required when using Docker deployment.

For Local Installation

If you're using the local installation method with Uvicorn, you must set up Ollama separately before running RAPTOR:

Install Ollama:

# Linux
curl -fsSL https://ollama.ai/install.sh | sh

# macOS
brew install ollama

# Windows
# Download from https://ollama.ai/download

Start Ollama service:
```
ollama serve
```
Pull required model (default: gemma3:4b):
```
ollama pull gemma3:4b
```

The RAPTOR API will connect to Ollama at http://localhost:11434 by default. You can change this by setting the OLLAMA_BASE_URL environment variable.

Using the API

API Endpoints

POST /raptor/ Processes a document and generates hierarchical summaries.
Authentication: If API_TOKEN is configured, all POST requests must include the header:
```
Authorization: Bearer <your-token>
```
Leave API_TOKEN empty (default) to disable authentication.
Parameters:
- file: JSON file containing text chunks to be summarized
- llm_model: LLM model to use for summarization (string, default: "gemma3:4b")
- embedder_model: Model to use for generating embeddings (string, default: "BAAI/bge-m3")
- threshold_tokens: Maximum token limit for summaries (integer, optional)
- temperature: Controls randomness in LLM output (float, default: 0.1)
- context_window: Maximum context window size for LLM (integer, default: 16384)
- custom_instructions: Custom instructions for summarization (string, optional). Text chunk is added automatically.
- chunk_metadata_json: JSON string of metadata to merge into each output chunk at the top level (string, optional)
Expected JSON Input Format:
```
{
  "chunks": [
    {
      "text": "First chunk of text content...",
      "metadata": "Optional metadata"
    },
    {
      "text": "Second chunk of text content...",
      "id": 12345
    }
  ]
}
```
GET /
Health check endpoint returning service status, GPU availability, API version, and Ollama connectivity status.

Example API Call

Using cURL:

# Basic usage (no authentication)
curl -X POST "http://localhost:8002/raptor/" \
  -F "file=@document.json" \
  -H "accept: application/json"

# With authentication (when API_TOKEN is set)
curl -X POST "http://localhost:8002/raptor/" \
  -F "file=@document.json" \
  -H "accept: application/json" \
  -H "Authorization: Bearer your-token-here"

# With custom parameters
curl -X POST "http://localhost:8002/raptor/?llm_model=qwen2.5:7b-instruct&temperature=0.2&threshold_tokens=4000" \
  -F "file=@document.json" \
  -H "accept: application/json"

Using Python:

import requests
import json

# API endpoint
api_url = 'http://localhost:8002/raptor/'
file_path = 'document.json'

# Prepare the document
document = {
    "chunks": [
        {"text": "Your first text chunk here..."},
        {"text": "Your second text chunk here..."}
    ]
}

# Save to file
with open(file_path, 'w') as f:
    json.dump(document, f)

# Make the request
try:
    with open(file_path, 'rb') as f:
        files = {'file': (file_path, f, 'application/json')}
        params = {
            'llm_model': 'qwen2.5:7b-instruct',
            'temperature': 0.3,
            'threshold_tokens': 4000
        }
        
        response = requests.post(api_url, files=files, params=params)
        response.raise_for_status()
        
        result = response.json()
        # Group chunks by cluster level
        level_1_chunks = [chunk for chunk in result['chunks'] if chunk['cluster_level'] == 1]
        level_2_chunks = [chunk for chunk in result['chunks'] if chunk['cluster_level'] == 2]
        level_3_chunks = [chunk for chunk in result['chunks'] if chunk['cluster_level'] == 3]
        
        print(f"Generated summaries at {len(set(chunk['cluster_level'] for chunk in result['chunks']))} levels")
        print(f"Level 1 summaries: {len(level_1_chunks)}")
        print(f"Level 2 summaries: {len(level_2_chunks)}")
        print(f"Level 3 summaries: {len(level_3_chunks)}")
        print(f"Sample level 3 summary: {level_3_chunks[0]['text'][:200]}..." if level_3_chunks else "No level 3 summary available")
        
except Exception as e:
    print(f"Error: {e}")

Response Format

A successful summarization returns a hierarchical structure:

{
  "chunks": [
    {
      "text": "First level summary with moderate compression...",
      "token_count": 2500,
      "cluster_level": 1,
      "id": 1
    },
    {
      "text": "Another first level summary...",
      "token_count": 1800,
      "cluster_level": 1,
      "id": 2
    },
    {
      "text": "Second level summary with higher compression...",
      "token_count": 1200,
      "cluster_level": 2,
      "id": 3
    },
    {
      "text": "Third level summary, highly condensed...",
      "token_count": 600,
      "cluster_level": 3,
      "id": 4
    }
  ],
  "metadata": {
    "input_chunks": 20,
    "level_1_clusters": 5,
    "level_2_clusters": 2,
    "level_3_clusters": 1,
    "total_clusters": 8,
    "reduction_ratio": 0.6,
    "llm_model": "qwen2.5:7b-instruct",
    "embedder_model": "BAAI/bge-m3",
    "temperature": 0.1,
    "context_window": 16384,
    "custom_prompt_used": false,
    "source": "document.json",
    "processing_time": {
      "total": 45.2,
      "level_1": 20.5,
      "level_2": 15.3,
      "level_3": 9.4
    }
  }
}

Configuration

RAPTOR can be tuned through environment variables (for Docker deployments) or a local .env file. The table below lists every variable consumed by the application together with its default value:

Core Application Variables

Variable	Description	Default
`API_TOKEN`	Bearer token for API authentication. Leave empty to disable. When set, all POST requests require `Authorization: Bearer <token>`. `GET /` is always public.	(disabled)
`OLLAMA_BASE_URL`	Base URL of the Ollama API server	`http://localhost:11434` (CPU: `http://ollama-cpu:11434`, GPU: `http://ollama-gpu:11434` in Docker)
`OLLAMA_API_KEY`	API key for authenticated external Ollama instances (`cpu-external` / `gpu-external` profiles). Sent as `Authorization: Bearer <key>`. Leave empty if not required.	(disabled)
`LLM_MODEL`	Default LLM model used for summarization	`gemma3:4b`
`LLM_MAX_WORKERS`	Max concurrent LLM requests (set <= OLLAMA_NUM_PARALLEL)	`2`
`LLM_MAX_RETRIES`	Number of retry attempts for failed LLM requests	`3`
`LLM_BASE_DELAY`	Base delay in seconds for exponential backoff between retries	`1.0`
`LLM_TIMEOUT`	Timeout in seconds for each LLM request	`600`
`OLLAMA_NUM_THREAD`	CPU threads for Ollama inference	`8`
`OLLAMA_NUM_GPU`	GPU layers for Ollama (99 = all on GPU)	`99`
`OLLAMA_NUM_PREDICT`	Max output tokens per LLM generation	`512`
`OLLAMA_NUM_PARALLEL`	Max parallel requests the bundled Ollama container can handle. Configures Ollama, not the RAPTOR app directly (set ≥ `LLM_MAX_WORKERS`)	`2`
`EMBEDDER_MODEL`	Sentence-Transformer model used for embeddings	`BAAI/bge-m3`
`TEMPERATURE`	Sampling temperature for the LLM	`0.1`
`CONTEXT_WINDOW`	Maximum token window supplied to the LLM	`16384`
`RANDOM_SEED`	Seed for deterministic operations	`224`
`MAX_WORKERS`	Number of worker threads (absolute or percentage)	`75% of CPU cores`
`MODEL_CACHE_TIMEOUT`	Seconds before an unused model is evicted from cache	`3600`
`LOG_LEVEL`	Logging verbosity passed to the Docker container environment. Note: the Python application sets logging to `INFO` unconditionally and does not read this variable at runtime.	`INFO`

Docker-Specific Variables

Variable	Description	Default
`APP_PORT`	Host port mapped to the RAPTOR API container	`8002`
`TOKENIZERS_PARALLELISM`	Enable/disable tokenizers parallelism	`false`
`OMP_NUM_THREADS`	OpenMP thread count	`4`
`MKL_NUM_THREADS`	Intel MKL thread count	`4`
`PYTORCH_CUDA_ALLOC_CONF`	PyTorch CUDA memory allocation configuration	`max_split_size_mb:512`
`CUDA_LAUNCH_BLOCKING`	CUDA kernel launch blocking mode	`0`
`PYTHONUNBUFFERED`	Python output buffering	`1`
`OLLAMA_VERSION`	Ollama image version tag for bundled containers. Leave empty for `latest`. Pin for reproducible deploys (e.g. `0.6.5`).	(latest)
`OLLAMA_PORT`	Host port for the bundled Ollama container	`11435`
`OLLAMA_CONTEXT_SIZE`	Context size passed to the bundled Ollama container (sets the model context window at the Ollama level)	`16384`

Note: MODEL_CACHE_TIMEOUT is read directly by the API (raptor_api.py, line 566) to control how long a model remains in the on-disk cache. The LOG_LEVEL variable is set in the Docker environment but is not consumed by the Python application, which logs at INFO unconditionally. Docker deployments use different OLLAMA_BASE_URL defaults depending on the profile (CPU/GPU).

Custom Instructions

RAPTOR allows you to customize the summarization behavior by providing your own instructions through the custom_instructions parameter.

How it works

You only need to provide the instructions (e.g., "Summarize in French", "Focus on technical details"). The system automatically wraps your instructions with the text chunk properly formatted.

Example Custom Instructions:

You are a financial analyst. Summarize the text below focusing on revenue growth, margins, and key risks.
- Use bullet points.
- Be concise.

Default Instructions

If no custom instructions are provided, RAPTOR uses the following default system instructions:

DEFAULT_INSTRUCTIONS = """You are an expert analyst. Summarize the text below in a single, high-density narrative paragraph following these rules:

1. CORE OBJECTIVE:
- Capture the primary theme and all essential facts, numbers, and entities.
- Identify and preserve all unique entities (proper nouns, specific locations) with exact naming.

2. ACCURACY & ATTRIBUTION (CRITICAL):
- STRICTLY derive all information from the provided text. Do not add external knowledge.
- VERIFY ATTRIBUTION: Ensure quotes, actions, and events are assigned to the correct entity/character. Do not shift actions between characters.
- NO HALLUCINATIONS: Do not invent causal links or details to "smooth out" the narrative. If the text is disjointed, summarize it as disjointed.
- Use only information explicitly stated in the source text.

3. STYLE & FORMATTING:
- Maintain a strict 3rd-person objective perspective.
- Write ONE cohesive paragraph. No bullet points or lists.
- Start IMMEDIATELY with the content—no introductions or meta-commentary.
- Use the EXACT SAME LANGUAGE as the original text."""

Important Note

Do NOT include {chunk} or XML tags in your custom instructions. The system handles the text insertion automatically to ensure optimal performance with Ollama.

Important Note about LLM models with thinking abilities

RAPTOR does not send a think parameter to Ollama. Models with chain-of-thought or reasoning capabilities will use their default behavior as configured in Ollama.

Contributing

Progressive-Summarizer-RAPTOR is an open-source project that welcomes contributions from the community. Your involvement helps make the tool better for everyone.

We value contributions of all kinds:

Bug fixes and performance improvements
Documentation enhancements
New features and capabilities
Test coverage improvements
Integration examples and tutorials

If you're interested in contributing:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Add or update tests as appropriate
Ensure all tests pass
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Please ensure your code follows the existing style conventions and includes appropriate documentation.

For major changes, please open an issue first to discuss what you would like to change.

Happy Summarizing with RAPTOR!

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
docker		docker
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
WHAT_IS_IT.md		WHAT_IS_IT.md
logo.png		logo.png
pytest.ini		pytest.ini
raptor_api.py		raptor_api.py
requirements.txt		requirements.txt
what-is-it.jpg		what-is-it.jpg

Folders and files

Latest commit

History

Repository files navigation

Progressive Summarizer RAPTOR

Key Features

Table of Contents

How the Summarization Algorithm Works

The Pipeline

Hierarchical Clustering and Summarization Process

Semantic Understanding Through Embeddings

Comparison with Traditional Summarization

Advantages of the Solution

Information Hierarchy

Superior Performance

Flexibility and Customization

Installation and Deployment

Prerequisites

Getting the Code

Local Installation with Uvicorn

Option B: Docker Compose (Local Build)

Option A: Pre-built Image from GitHub Container Registry

Using an external Ollama instance

Ollama Setup

For Docker Deployment

For Local Installation

Using the API

API Endpoints

Example API Call

Response Format

Configuration

Core Application Variables

Docker-Specific Variables

Custom Instructions

How it works

Default Instructions

Important Note

Important Note about LLM models with thinking abilities

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages