Extract atomic propositions from any document. Strip filler, keep substance.
Takes a book, paper, or report and decomposes it into self-contained atomic facts -- one proposition per line. Strips introductions, transitions, anecdotes, hedging, repetition, and padding. Keeps claims, frameworks, data points, definitions, and actionable knowledge.
Based on the Dense X Retrieval methodology (Chen et al., 2023).
pip install propositionizer
# With PDF support
pip install propositionizer[pdf]
# With everything (PDF + EPUB + HTML + API server)
pip install propositionizer[all]Or install from source:
git clone https://github.com/bogdanalexandruradu/propositionizer.git
cd propositionizer
pip install -e ".[all]"# Single file (uses your OpenAI key)
propositionize book.txt --provider openai --api-key sk-xxx
# PDF with Gemini (free tier works)
propositionize paper.pdf --provider gemini --api-key AIza...
# Whole directory
propositionize ./research-papers/ --provider openai --api-key sk-...
# Use env file instead of passing key
export OPENAI_API_KEY=sk-xxx
propositionize book.txt
# Or point to a .env file
propositionize book.txt --env .env
# Resume interrupted job (progress saved per-chunk)
propositionize big-book.txt --resume
# Check progress
propositionize --status -o ./outputpropositionize --serve --port 8000Then from any client:
# Submit a document
curl -X POST http://localhost:8000/propositionize \
-F file=@book.pdf \
-F provider=openai \
-F api_key=sk-xxx
# Check progress
curl http://localhost:8000/status/{job_id}
# Download results
curl http://localhost:8000/result/{job_id} -o propositions.jsonlfrom propositionizer import Propositionizer
p = Propositionizer(provider="gemini", api_key="AIza...")
result = p.process_file("book.pdf")
print(f"Extracted {result['total_propositions']} propositions")
print(f"Compression: {result['compression_ratio']}")| Provider | Model Default | Env Var |
|---|---|---|
openai |
gpt-4.1-mini | OPENAI_API_KEY |
gemini |
gemini-2.5-flash | GOOGLE_API_KEY |
anthropic |
claude-sonnet-4-6 | ANTHROPIC_API_KEY |
cerebras |
qwen-3-235b | CEREBRAS_API_KEY |
groq |
llama-3.3-70b | GROQ_API_KEY |
openrouter |
gemini-2.5-flash:free | OPENROUTER_API_KEY |
custom |
(you specify) | (you specify) |
Pass --model to override the default for any provider.
.txt,.md,.rst-- plain text.pdf-- requirespip install propositionizer[pdf].epub-- requirespip install propositionizer[epub].html,.htm-- auto-strips nav/footer/scripts.json,.jsonl-- extracts text fields
JSONL file with one proposition per line:
{"proposition": "The North Star Metric is a single primary metric that a company focuses on.", "source_file": "book.txt", "chunk_index": 3, "chunk_hash": "a1b2c3d4e5f6g7h8"}
{"proposition": "Companies with a defined NSM grow 2.4x faster than those without.", "source_file": "book.txt", "chunk_index": 3, "chunk_hash": "a1b2c3d4e5f6g7h8"}- Extract text from the input file (PDF, EPUB, HTML, etc.)
- Chunk the text into ~8000 char segments at paragraph boundaries
- Propositionize each chunk via LLM -- decompose into atomic, self-contained facts
- Write results incrementally to JSONL (survives interruption)
The LLM prompt instructs the model to:
- Strip all filler (intros, transitions, anecdotes, hedging, repetition)
- Keep all substance (claims, frameworks, data, definitions, methods)
- Decontextualize (replace pronouns, make each fact standalone)
- Separate claims from their evidence into distinct propositions
| Flag | Default | Description |
|---|---|---|
--provider |
openai | LLM provider |
--api-key |
(env var) | API key |
--model |
(provider default) | Override model |
--chunk-size |
8000 | Characters per chunk |
--workers |
4 | Concurrent API calls |
--temperature |
0.3 | Lower = more faithful |
--output |
./output | Output directory |
--resume |
false | Resume interrupted job |
--serve |
false | Start API server |
--port |
8000 | API server port |
MIT