feat(toki): persistent SQLite leaderboard + API + live demo page (v0.10.0) by konjoinfinity · Pull Request #2 · konjoai/toki

konjoinfinity · 2026-05-09T21:25:38Z

Summary

T3 — adds a persistent leaderboard that tracks model robustness scores over time, exposed via three new HTTP endpoints and a live HTML page.

toki/leaderboard.py — pure-stdlib sqlite3 module. LeaderboardEntry dataclass + Leaderboard class with record, top_n, history, compare. Schema auto-creates on first use; values validated at the unit-interval boundary.
API (added to demo/server.py): POST /api/leaderboard, GET /api/leaderboard/{suite}, GET /api/leaderboard/model/{name}. Suite ∈ adversarial | paraphrase | noise | all.
demo/leaderboard.html — live page at /leaderboard.html. Suite filter tabs, 10s auto-refresh, score colour-coding (green ≥0.85 / yellow ≥0.70 / red <0.70), row-flash on new entries, offline indicator.
demo/seed_leaderboard.json — 8 entries across phi-3-mini-4k, qwen-2.5-1.5b, llama-3.2-3b, gemma-2-2b and all three suites; auto-loaded on first request.

Breaking rename: Phase-7 leaderboard → ranking

The Phase-7 module was a one-shot k-model Bonferroni-corrected ranking operation, not a persistent leaderboard. To keep the namespace honest:

toki.leaderboard (Phase 7) → toki.ranking
Leaderboard / Config / Entry / Result → Ranking / Config / Entry / Result
CLI: python -m toki leaderboard → python -m toki rank
Save artefact: leaderboard.json → ranking.json
Default --output-dir: experiments/leaderboards → experiments/rankings

No compatibility shim (pre-1.0). All Phase-7 tests carried over verbatim under the new module name and still pass.

Test plan

python -m pytest python/tests/ — 189 passed (10 new in test_leaderboard.py, 20 carried over in test_ranking.py, CLI tests renamed)
cargo test — 7 passed
cargo clippy -- -D warnings — clean
End-to-end smoke test of all three new endpoints against a live demo/server.py (seed loaded, POST records, suite + model GETs return expected shape)
Open /leaderboard.html in a browser — table renders, tabs filter, refresh ticks, error banner shows when server is down

🤖 Generated with Claude Code

…10.0) T3 — give toki a leaderboard that survives across runs and tracks robustness over time, surfaced as a live HTML page. New module — toki.leaderboard (pure stdlib sqlite3): - LeaderboardEntry(model_name, suite, pass_rate, robustness_score, timestamp, notes, id) with [0,1] range + NaN + non-empty validation at the boundary. - Leaderboard.record / record_many / top_n / history / compare / all / count. top_n("all") drops the suite filter for global ranking. compare() picks latest-per-suite for both models and resolves a winner over the overlapping suites. - Schema auto-creates on first connection; KNOWN_SUITES = (adversarial, paraphrase, noise) is the public contract for tabs. Demo wiring (demo/server.py): - POST /api/leaderboard, GET /api/leaderboard/{suite}, GET /api/leaderboard/model/{name}. Lazy-init singleton; auto-seeds from demo/seed_leaderboard.json on empty DB. - demo/leaderboard.html — live page with suite filter tabs, 10s auto-refresh, color-coded scores (green ≥0.85, yellow ≥0.70, red <0.70), row-flash on new rows, offline indicator. - demo/seed_leaderboard.json — 8 entries across 4 models (phi-3, qwen-2.5, llama-3.2, gemma-2) and all 3 suites. Breaking rename — Phase-7 leaderboard → ranking: The Phase-7 module was a one-shot k-model Bonferroni-corrected *ranking* operation, not a persistent leaderboard. Renaming it frees the leaderboard namespace for what T3 actually needs. - toki.leaderboard (Phase 7) → toki.ranking - Leaderboard/Config/Entry/Result → Ranking/Config/Entry/Result - CLI: python -m toki leaderboard → python -m toki rank - Save artefact leaderboard.json → ranking.json - Default --output-dir experiments/leaderboards → experiments/rankings - toki.__init__ exports updated; no compatibility shim (pre-1.0) Tests: - python/tests/test_leaderboard.py — 10 new tests covering schema + empty reads, record round-trip + auto-id, top_n sort/cap/filter, top_n("all") global, chronological history, compare latest-per- suite + winner, score range validation, load_seed bulk insert, cross-instance persistence + KNOWN_SUITES contract. - test_main.py CLI tests renamed (leaderboard → rank), assert ranking.json artefact. - test_ranking.py — Phase-7's 20 tests carried over verbatim. Full test suite: 189 Python tests pass, 7 Rust tests pass, cargo clippy clean. demo/leaderboard.db is gitignored. Versions bumped: pyproject.toml + toki/__init__.py → 0.10.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wesleyscholl merged commit 939f04a into main May 9, 2026
2 checks passed

wesleyscholl deleted the claude/keen-ramanujan-a177e3 branch May 9, 2026 23:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(toki): persistent SQLite leaderboard + API + live demo page (v0.10.0)#2

feat(toki): persistent SQLite leaderboard + API + live demo page (v0.10.0)#2
wesleyscholl merged 1 commit into
mainfrom
claude/keen-ramanujan-a177e3

konjoinfinity commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

konjoinfinity commented May 9, 2026

Summary

Breaking rename: Phase-7 leaderboard → ranking

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants