sorunokoe · Copilot · Jun 2, 2026 · Jun 2, 2026 · Jun 2, 2026 · Jun 2, 2026
@@ -42,7 +42,7 @@ jobs:
           python-version: ${{ matrix.python-version }}
 
       - name: Install package
-        run: pip install -e ".[logic,nlp]" scikit-learn
+        run: pip install -e ".[logic,nlp]" scikit-learn click
 
       - name: Download spaCy model
         run: python -m spacy download en_core_web_sm

@@ -86,7 +86,17 @@ The typical workflow:
 ### Installation
 
 ```bash
-pip install -e .
+pip install -e .                       # core package
+```
+
+**Optional extras** (unlock additional features):
+
+```bash
+pip install -e ".[nlp]"                # spaCy + word2number (syllogism, arithmetic solver)
+python -m spacy download en_core_web_sm
+pip install -e ".[logic]"              # Z3 solver (formal entailment checking)
+pip install -e ".[semantic]"           # sentence-transformers (semantic similarity)
+pip install -e ".[rest]"               # requests (REST API client)
 ```
 
 ### 5-Minute Quickstart
@@ -115,13 +125,23 @@ else:
 
 ### Real-World Examples
 
-See [`examples/`](./examples/) for production-ready code:
+See [`examples/`](./examples/) for production-ready, tested code:
+
+| Example | File | What it covers |
+|---------|------|----------------|
+| **Guard Verification** | [`guard_verification.py`](./examples/guard_verification.py) | ECS scoring, thresholds, repair, degradation tracking |
+| **Chain-of-Thought** | [`chain_of_thought.py`](./examples/chain_of_thought.py) | Multi-step reasoning verification |
+| **Arithmetic Solver** | [`arithmetic_solver.py`](./examples/arithmetic_solver.py) | Word problem solving end-to-end |
+| **Syllogism Checker** | [`syllogism_verification.py`](./examples/syllogism_verification.py) | Formal logic verification (Z3 + heuristics) |
+| **MCQ Picker** | [`mcq_picker.py`](./examples/mcq_picker.py) | Multiple-choice answer selection |
+| **Arithmetic Repair** | [`arithmetic_repair.py`](./examples/arithmetic_repair.py) | Deterministic error correction |
+| **Simple Verification** | [`simple_verification.py`](./examples/simple_verification.py) | Quick-start 3-claim demo |
+| **LangChain Integration** | [`langchain_integration.py`](./examples/langchain_integration.py) | LangChain pipeline wrapper |
+| **API Server** | [`api_server.py`](./examples/api_server.py) | FastAPI microservice |
 
-- **[`simple_verification.py`](./examples/simple_verification.py)** - Basic usage (5 min)
-- **[`langchain_integration.py`](./examples/langchain_integration.py)** - LangChain integration (10 min)
-- **[`api_server.py`](./examples/api_server.py)** - Production FastAPI server (15 min)
+All examples have tests in [`tests/test_examples.py`](./tests/test_examples.py).
 
-Run the simple example:
+Run any example:
 ```bash
 python examples/simple_verification.py
 ```
@@ -163,14 +183,24 @@ Your agent (Claude Desktop, Cursor, GitHub Copilot) can then call PureReason ver
 ### 3. Python API (Advanced)
 
 ```python
-from pureason.reasoning import verify_chain
+from pureason.reasoning import verify_chain, solve_arithmetic, verify_syllogism
 
 # Verify a chain of reasoning steps
 problem = "What is 2 + 2?"
 steps = ["Let me add the numbers.", "2 + 2 = 4", "Therefore, the answer is 4."]
-
 result = verify_chain(problem, steps)
-print(f"Confidence: {result.ecs}/100")
+print(f"Valid: {result.is_valid}, Confidence: {result.chain_confidence:.2f}")
+
+# Solve an arithmetic word problem
+report = solve_arithmetic("Maria has 15 apples. She buys 8 more. How many in total?")
+print(f"Answer: {report.answer}")
+
+# Verify a syllogism
+report = verify_syllogism(
+    premises=["All mammals are warm-blooded.", "Whales are mammals."],
+    conclusion="Whales are warm-blooded.",
+)
+print(f"Valid: {report.is_valid}")
 ```
 
 ## Core Features
@@ -235,6 +265,8 @@ cargo build --release
 
 | Topic | Link |
 |-------|------|
+| **Examples** | [`examples/README.md`](./examples/README.md) - Tested use cases with code |
+| **Improvement Plan** | [`docs/IMPROVEMENT-PLAN.md`](./docs/IMPROVEMENT-PLAN.md) - Roadmap for next improvements |
 | **Benchmarks** | [`docs/BENCHMARK.md`](./docs/BENCHMARK.md) - Full results and methodology |
 | **Reproducibility** | [`docs/REPRODUCIBILITY.md`](./docs/REPRODUCIBILITY.md) - Seeds, hashes, holdout |
 | **MCP Integration** | [`docs/MCP-INTEGRATION.md`](./docs/MCP-INTEGRATION.md) - Agent setup guide |

@@ -0,0 +1,102 @@
+# PureReason Improvement Plan
+
+> Generated from hands-on exploration and testing of v0.3.1.
+
+## Findings Summary
+
+### What Works Well
+- **Arithmetic repair** — deterministic `A op B = C` repair is reliable and fast.
+- **Chain-of-thought verification** — `verify_chain` correctly flags arithmetic errors and accumulates context.
+- **MCQ picker** — tie detection with `AmbiguousAnswerError` is well-designed.
+- **Guard API** — `ReasoningGuard` provides a clean, simple entry point.
+- **Degradation tracking** — `_ReputationTracker` is a practical production feature.
+
+### Issues Found
+
+| # | Issue | Severity | Status |
+|---|-------|----------|--------|
+| 1 | Word-number extraction fails without `word2number` installed — tests expect it but the package is optional | Medium | Documented |
+| 2 | Examples were generic and untested — no way for consumers to validate setup | High | **Fixed** |
+| 3 | `verify_chain` falls back to ECS=50 when Rust binary is unavailable — no clear indication to the user | Medium | Documented |
+| 4 | `_ecs_score` in `guard.py` silently returns 75.0 on any exception — masks real failures | Medium | Documented |
+| 5 | Examples README referenced `verify_chain(llm_output)` with wrong signature (missing `steps` parameter) | High | **Fixed** |
+| 6 | No test coverage for example use cases | High | **Fixed** |
+| 7 | `solve_arithmetic` relies on spaCy NLP model but error message is unclear | Low | Documented |
+
+---
+
+## Improvement Plan
+
+### Phase 1: Examples & Documentation (completed)
+
+- [x] Create 6 focused, tested example files covering every Python use case
+- [x] Add `tests/test_examples.py` with 36 tests validating all examples
+- [x] Rewrite `examples/README.md` with per-use-case documentation
+- [x] Update `README.md` with accurate code samples and API references
+- [x] Document expected inputs, outputs, and edge cases
+
+### Phase 2: Robustness (recommended next)
+
+- [ ] **Graceful fallback messaging** — When the Rust binary is unavailable,
+  `_ecs_score` should log a clear warning (not silently return 75.0).
+  Suggested: use `warnings.warn()` on first fallback.
+
+- [ ] **Optional dependency handling** — `_extract_numbers` silently skips
+  word-form numbers when `word2number` is not installed.  Add a one-time
+  warning so users know they're missing functionality.
+
+- [ ] **Consolidate install instructions** — The `pyproject.toml` optional groups
+  (`[nlp]`, `[logic]`, `[semantic]`, `[rest]`) should be documented in a
+  single "Installation" section in the README so users know what each extra
+  provides.
+
+### Phase 3: Test Coverage
+
+- [ ] **Integration tests with Rust binary** — Add a CI job that builds the
+  Rust binary and runs tests without mocking `_core._run`.
+
+- [ ] **Benchmark regression tests** — Add a small smoke-test subset of
+  the HaluEval/TruthfulQA benchmarks that runs in CI to catch ECS score
+  regressions.
+
+- [ ] **Property-based testing** — Use `hypothesis` for arithmetic repair
+  to verify `_repair_arithmetic_in_step` handles edge cases like very large
+  numbers, unicode operators, and chained expressions.
+
+### Phase 4: API Ergonomics
+
+- [ ] **Typed return objects everywhere** — `pick_best_answer` returns
+  `tuple[int, EpistemicChainReport]` which is not self-documenting.
+  Consider a `MCQResult` dataclass.
+
+- [ ] **Batch verification API** — `ReasoningGuard.verify_batch(texts)` to
+  verify multiple texts in a single call (parallel processing).
+
+- [ ] **Structured error types** — Replace generic `Exception` catches with
+  specific error types (`BinaryNotFoundError`, `ParseError`, etc.).
+
+### Phase 5: Performance
+
+- [ ] **Lazy NLP model loading** — spaCy model is loaded on first call to
+  `_detect_operation`.  Add explicit `init()` method for applications that
+  want to control startup latency.
+
+- [ ] **Caching** — `_ecs_for_text` could cache results for repeated texts
+  (LRU cache with configurable size).
+
+---
+
+## Priority Matrix
+
+| Priority | Effort | Items |
+|----------|--------|-------|
+| **High / Low effort** | Phase 1 (done), Phase 2 fallback warnings | 
+| **High / Medium effort** | Phase 3 integration tests |
+| **Medium / Low effort** | Phase 4 typed returns |
+| **Medium / High effort** | Phase 5 performance |
+
+## Recommendation
+
+Start with **Phase 2** (robustness) — it's low-effort and directly improves the
+developer experience for new consumers.  Then move to **Phase 3** (test coverage)
+to prevent regressions as the project grows.