Skip to content

BogdanAlRa/propositionizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Propositionizer

Extract atomic propositions from any document. Strip filler, keep substance.

Takes a book, paper, or report and decomposes it into self-contained atomic facts -- one proposition per line. Strips introductions, transitions, anecdotes, hedging, repetition, and padding. Keeps claims, frameworks, data points, definitions, and actionable knowledge.

Based on the Dense X Retrieval methodology (Chen et al., 2023).

Install

pip install propositionizer

# With PDF support
pip install propositionizer[pdf]

# With everything (PDF + EPUB + HTML + API server)
pip install propositionizer[all]

Or install from source:

git clone https://github.com/bogdanalexandruradu/propositionizer.git
cd propositionizer
pip install -e ".[all]"

Quick Start

CLI

# Single file (uses your OpenAI key)
propositionize book.txt --provider openai --api-key sk-xxx

# PDF with Gemini (free tier works)
propositionize paper.pdf --provider gemini --api-key AIza...

# Whole directory
propositionize ./research-papers/ --provider openai --api-key sk-...

# Use env file instead of passing key
export OPENAI_API_KEY=sk-xxx
propositionize book.txt

# Or point to a .env file
propositionize book.txt --env .env

# Resume interrupted job (progress saved per-chunk)
propositionize big-book.txt --resume

# Check progress
propositionize --status -o ./output

API Server

propositionize --serve --port 8000

Then from any client:

# Submit a document
curl -X POST http://localhost:8000/propositionize \
  -F file=@book.pdf \
  -F provider=openai \
  -F api_key=sk-xxx

# Check progress
curl http://localhost:8000/status/{job_id}

# Download results
curl http://localhost:8000/result/{job_id} -o propositions.jsonl

Python Library

from propositionizer import Propositionizer

p = Propositionizer(provider="gemini", api_key="AIza...")
result = p.process_file("book.pdf")

print(f"Extracted {result['total_propositions']} propositions")
print(f"Compression: {result['compression_ratio']}")

Supported Providers (BYOK)

Provider Model Default Env Var
openai gpt-4.1-mini OPENAI_API_KEY
gemini gemini-2.5-flash GOOGLE_API_KEY
anthropic claude-sonnet-4-6 ANTHROPIC_API_KEY
cerebras qwen-3-235b CEREBRAS_API_KEY
groq llama-3.3-70b GROQ_API_KEY
openrouter gemini-2.5-flash:free OPENROUTER_API_KEY
custom (you specify) (you specify)

Pass --model to override the default for any provider.

Supported File Formats

  • .txt, .md, .rst -- plain text
  • .pdf -- requires pip install propositionizer[pdf]
  • .epub -- requires pip install propositionizer[epub]
  • .html, .htm -- auto-strips nav/footer/scripts
  • .json, .jsonl -- extracts text fields

Output Format

JSONL file with one proposition per line:

{"proposition": "The North Star Metric is a single primary metric that a company focuses on.", "source_file": "book.txt", "chunk_index": 3, "chunk_hash": "a1b2c3d4e5f6g7h8"}
{"proposition": "Companies with a defined NSM grow 2.4x faster than those without.", "source_file": "book.txt", "chunk_index": 3, "chunk_hash": "a1b2c3d4e5f6g7h8"}

How It Works

  1. Extract text from the input file (PDF, EPUB, HTML, etc.)
  2. Chunk the text into ~8000 char segments at paragraph boundaries
  3. Propositionize each chunk via LLM -- decompose into atomic, self-contained facts
  4. Write results incrementally to JSONL (survives interruption)

The LLM prompt instructs the model to:

  • Strip all filler (intros, transitions, anecdotes, hedging, repetition)
  • Keep all substance (claims, frameworks, data, definitions, methods)
  • Decontextualize (replace pronouns, make each fact standalone)
  • Separate claims from their evidence into distinct propositions

Options

Flag Default Description
--provider openai LLM provider
--api-key (env var) API key
--model (provider default) Override model
--chunk-size 8000 Characters per chunk
--workers 4 Concurrent API calls
--temperature 0.3 Lower = more faithful
--output ./output Output directory
--resume false Resume interrupted job
--serve false Start API server
--port 8000 API server port

License

MIT

About

Extract atomic propositions from any document. Strip filler, keep substance. CLI + API + Library.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages