Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ jobs:
python-version: ${{ matrix.python-version }}

- name: Install package
run: pip install -e ".[logic,nlp]" scikit-learn
run: pip install -e ".[logic,nlp]" scikit-learn click

- name: Download spaCy model
run: python -m spacy download en_core_web_sm
Expand Down
50 changes: 41 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,17 @@ The typical workflow:
### Installation

```bash
pip install -e .
pip install -e . # core package
```

**Optional extras** (unlock additional features):

```bash
pip install -e ".[nlp]" # spaCy + word2number (syllogism, arithmetic solver)
python -m spacy download en_core_web_sm
pip install -e ".[logic]" # Z3 solver (formal entailment checking)
pip install -e ".[semantic]" # sentence-transformers (semantic similarity)
pip install -e ".[rest]" # requests (REST API client)
```

### 5-Minute Quickstart
Expand Down Expand Up @@ -115,13 +125,23 @@ else:

### Real-World Examples

See [`examples/`](./examples/) for production-ready code:
See [`examples/`](./examples/) for production-ready, tested code:

| Example | File | What it covers |
|---------|------|----------------|
| **Guard Verification** | [`guard_verification.py`](./examples/guard_verification.py) | ECS scoring, thresholds, repair, degradation tracking |
| **Chain-of-Thought** | [`chain_of_thought.py`](./examples/chain_of_thought.py) | Multi-step reasoning verification |
| **Arithmetic Solver** | [`arithmetic_solver.py`](./examples/arithmetic_solver.py) | Word problem solving end-to-end |
| **Syllogism Checker** | [`syllogism_verification.py`](./examples/syllogism_verification.py) | Formal logic verification (Z3 + heuristics) |
| **MCQ Picker** | [`mcq_picker.py`](./examples/mcq_picker.py) | Multiple-choice answer selection |
| **Arithmetic Repair** | [`arithmetic_repair.py`](./examples/arithmetic_repair.py) | Deterministic error correction |
| **Simple Verification** | [`simple_verification.py`](./examples/simple_verification.py) | Quick-start 3-claim demo |
| **LangChain Integration** | [`langchain_integration.py`](./examples/langchain_integration.py) | LangChain pipeline wrapper |
| **API Server** | [`api_server.py`](./examples/api_server.py) | FastAPI microservice |

- **[`simple_verification.py`](./examples/simple_verification.py)** - Basic usage (5 min)
- **[`langchain_integration.py`](./examples/langchain_integration.py)** - LangChain integration (10 min)
- **[`api_server.py`](./examples/api_server.py)** - Production FastAPI server (15 min)
All examples have tests in [`tests/test_examples.py`](./tests/test_examples.py).

Run the simple example:
Run any example:
```bash
python examples/simple_verification.py
```
Expand Down Expand Up @@ -163,14 +183,24 @@ Your agent (Claude Desktop, Cursor, GitHub Copilot) can then call PureReason ver
### 3. Python API (Advanced)

```python
from pureason.reasoning import verify_chain
from pureason.reasoning import verify_chain, solve_arithmetic, verify_syllogism

# Verify a chain of reasoning steps
problem = "What is 2 + 2?"
steps = ["Let me add the numbers.", "2 + 2 = 4", "Therefore, the answer is 4."]

result = verify_chain(problem, steps)
print(f"Confidence: {result.ecs}/100")
print(f"Valid: {result.is_valid}, Confidence: {result.chain_confidence:.2f}")

# Solve an arithmetic word problem
report = solve_arithmetic("Maria has 15 apples. She buys 8 more. How many in total?")
print(f"Answer: {report.answer}")

# Verify a syllogism
report = verify_syllogism(
premises=["All mammals are warm-blooded.", "Whales are mammals."],
conclusion="Whales are warm-blooded.",
)
print(f"Valid: {report.is_valid}")
```

## Core Features
Expand Down Expand Up @@ -235,6 +265,8 @@ cargo build --release

| Topic | Link |
|-------|------|
| **Examples** | [`examples/README.md`](./examples/README.md) - Tested use cases with code |
| **Improvement Plan** | [`docs/IMPROVEMENT-PLAN.md`](./docs/IMPROVEMENT-PLAN.md) - Roadmap for next improvements |
| **Benchmarks** | [`docs/BENCHMARK.md`](./docs/BENCHMARK.md) - Full results and methodology |
| **Reproducibility** | [`docs/REPRODUCIBILITY.md`](./docs/REPRODUCIBILITY.md) - Seeds, hashes, holdout |
| **MCP Integration** | [`docs/MCP-INTEGRATION.md`](./docs/MCP-INTEGRATION.md) - Agent setup guide |
Expand Down
Binary file added data/syllogism_clf.pkl
Binary file not shown.
102 changes: 102 additions & 0 deletions docs/IMPROVEMENT-PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# PureReason Improvement Plan

> Generated from hands-on exploration and testing of v0.3.1.

## Findings Summary

### What Works Well
- **Arithmetic repair** — deterministic `A op B = C` repair is reliable and fast.
- **Chain-of-thought verification** — `verify_chain` correctly flags arithmetic errors and accumulates context.
- **MCQ picker** — tie detection with `AmbiguousAnswerError` is well-designed.
- **Guard API** — `ReasoningGuard` provides a clean, simple entry point.
- **Degradation tracking** — `_ReputationTracker` is a practical production feature.

### Issues Found

| # | Issue | Severity | Status |
|---|-------|----------|--------|
| 1 | Word-number extraction fails without `word2number` installed — tests expect it but the package is optional | Medium | Documented |
| 2 | Examples were generic and untested — no way for consumers to validate setup | High | **Fixed** |
| 3 | `verify_chain` falls back to ECS=50 when Rust binary is unavailable — no clear indication to the user | Medium | Documented |
| 4 | `_ecs_score` in `guard.py` silently returns 75.0 on any exception — masks real failures | Medium | Documented |
| 5 | Examples README referenced `verify_chain(llm_output)` with wrong signature (missing `steps` parameter) | High | **Fixed** |
| 6 | No test coverage for example use cases | High | **Fixed** |
| 7 | `solve_arithmetic` relies on spaCy NLP model but error message is unclear | Low | Documented |

---

## Improvement Plan

### Phase 1: Examples & Documentation (completed)

- [x] Create 6 focused, tested example files covering every Python use case
- [x] Add `tests/test_examples.py` with 36 tests validating all examples
- [x] Rewrite `examples/README.md` with per-use-case documentation
- [x] Update `README.md` with accurate code samples and API references
- [x] Document expected inputs, outputs, and edge cases

### Phase 2: Robustness (recommended next)

- [ ] **Graceful fallback messaging** — When the Rust binary is unavailable,
`_ecs_score` should log a clear warning (not silently return 75.0).
Suggested: use `warnings.warn()` on first fallback.

- [ ] **Optional dependency handling** — `_extract_numbers` silently skips
word-form numbers when `word2number` is not installed. Add a one-time
warning so users know they're missing functionality.

- [ ] **Consolidate install instructions** — The `pyproject.toml` optional groups
(`[nlp]`, `[logic]`, `[semantic]`, `[rest]`) should be documented in a
single "Installation" section in the README so users know what each extra
provides.

### Phase 3: Test Coverage

- [ ] **Integration tests with Rust binary** — Add a CI job that builds the
Rust binary and runs tests without mocking `_core._run`.

- [ ] **Benchmark regression tests** — Add a small smoke-test subset of
the HaluEval/TruthfulQA benchmarks that runs in CI to catch ECS score
regressions.

- [ ] **Property-based testing** — Use `hypothesis` for arithmetic repair
to verify `_repair_arithmetic_in_step` handles edge cases like very large
numbers, unicode operators, and chained expressions.

### Phase 4: API Ergonomics

- [ ] **Typed return objects everywhere** — `pick_best_answer` returns
`tuple[int, EpistemicChainReport]` which is not self-documenting.
Consider a `MCQResult` dataclass.

- [ ] **Batch verification API** — `ReasoningGuard.verify_batch(texts)` to
verify multiple texts in a single call (parallel processing).

- [ ] **Structured error types** — Replace generic `Exception` catches with
specific error types (`BinaryNotFoundError`, `ParseError`, etc.).

### Phase 5: Performance

- [ ] **Lazy NLP model loading** — spaCy model is loaded on first call to
`_detect_operation`. Add explicit `init()` method for applications that
want to control startup latency.

- [ ] **Caching** — `_ecs_for_text` could cache results for repeated texts
(LRU cache with configurable size).

---

## Priority Matrix

| Priority | Effort | Items |
|----------|--------|-------|
| **High / Low effort** | Phase 1 (done), Phase 2 fallback warnings |
| **High / Medium effort** | Phase 3 integration tests |
| **Medium / Low effort** | Phase 4 typed returns |
| **Medium / High effort** | Phase 5 performance |

## Recommendation

Start with **Phase 2** (robustness) — it's low-effort and directly improves the
developer experience for new consumers. Then move to **Phase 3** (test coverage)
to prevent regressions as the project grows.
Loading