Skip to content

feat(toki): persistent SQLite leaderboard + API + live demo page (v0.10.0)#2

Merged
wesleyscholl merged 1 commit into
mainfrom
claude/keen-ramanujan-a177e3
May 9, 2026
Merged

feat(toki): persistent SQLite leaderboard + API + live demo page (v0.10.0)#2
wesleyscholl merged 1 commit into
mainfrom
claude/keen-ramanujan-a177e3

Conversation

@konjoinfinity
Copy link
Copy Markdown
Contributor

Summary

T3 β€” adds a persistent leaderboard that tracks model robustness scores over time, exposed via three new HTTP endpoints and a live HTML page.

  • toki/leaderboard.py β€” pure-stdlib sqlite3 module. LeaderboardEntry dataclass + Leaderboard class with record, top_n, history, compare. Schema auto-creates on first use; values validated at the unit-interval boundary.
  • API (added to demo/server.py): POST /api/leaderboard, GET /api/leaderboard/{suite}, GET /api/leaderboard/model/{name}. Suite ∈ adversarial | paraphrase | noise | all.
  • demo/leaderboard.html β€” live page at /leaderboard.html. Suite filter tabs, 10s auto-refresh, score colour-coding (green β‰₯0.85 / yellow β‰₯0.70 / red <0.70), row-flash on new entries, offline indicator.
  • demo/seed_leaderboard.json β€” 8 entries across phi-3-mini-4k, qwen-2.5-1.5b, llama-3.2-3b, gemma-2-2b and all three suites; auto-loaded on first request.

Breaking rename: Phase-7 leaderboard β†’ ranking

The Phase-7 module was a one-shot k-model Bonferroni-corrected ranking operation, not a persistent leaderboard. To keep the namespace honest:

  • toki.leaderboard (Phase 7) β†’ toki.ranking
  • Leaderboard / Config / Entry / Result β†’ Ranking / Config / Entry / Result
  • CLI: python -m toki leaderboard β†’ python -m toki rank
  • Save artefact: leaderboard.json β†’ ranking.json
  • Default --output-dir: experiments/leaderboards β†’ experiments/rankings

No compatibility shim (pre-1.0). All Phase-7 tests carried over verbatim under the new module name and still pass.

Test plan

  • python -m pytest python/tests/ β€” 189 passed (10 new in test_leaderboard.py, 20 carried over in test_ranking.py, CLI tests renamed)
  • cargo test β€” 7 passed
  • cargo clippy -- -D warnings β€” clean
  • End-to-end smoke test of all three new endpoints against a live demo/server.py (seed loaded, POST records, suite + model GETs return expected shape)
  • Open /leaderboard.html in a browser β€” table renders, tabs filter, refresh ticks, error banner shows when server is down

πŸ€– Generated with Claude Code

…10.0)

T3 β€” give toki a leaderboard that survives across runs and tracks
robustness over time, surfaced as a live HTML page.

New module β€” toki.leaderboard (pure stdlib sqlite3):
- LeaderboardEntry(model_name, suite, pass_rate, robustness_score,
  timestamp, notes, id) with [0,1] range + NaN + non-empty validation
  at the boundary.
- Leaderboard.record / record_many / top_n / history / compare / all /
  count. top_n("all") drops the suite filter for global ranking.
  compare() picks latest-per-suite for both models and resolves a
  winner over the overlapping suites.
- Schema auto-creates on first connection; KNOWN_SUITES =
  (adversarial, paraphrase, noise) is the public contract for tabs.

Demo wiring (demo/server.py):
- POST /api/leaderboard, GET /api/leaderboard/{suite}, GET
  /api/leaderboard/model/{name}. Lazy-init singleton; auto-seeds
  from demo/seed_leaderboard.json on empty DB.
- demo/leaderboard.html β€” live page with suite filter tabs, 10s
  auto-refresh, color-coded scores (green β‰₯0.85, yellow β‰₯0.70,
  red <0.70), row-flash on new rows, offline indicator.
- demo/seed_leaderboard.json β€” 8 entries across 4 models
  (phi-3, qwen-2.5, llama-3.2, gemma-2) and all 3 suites.

Breaking rename β€” Phase-7 leaderboard β†’ ranking:
The Phase-7 module was a one-shot k-model Bonferroni-corrected
*ranking* operation, not a persistent leaderboard. Renaming it
frees the leaderboard namespace for what T3 actually needs.
- toki.leaderboard (Phase 7) β†’ toki.ranking
- Leaderboard/Config/Entry/Result β†’ Ranking/Config/Entry/Result
- CLI: python -m toki leaderboard β†’ python -m toki rank
- Save artefact leaderboard.json β†’ ranking.json
- Default --output-dir experiments/leaderboards β†’
  experiments/rankings
- toki.__init__ exports updated; no compatibility shim (pre-1.0)

Tests:
- python/tests/test_leaderboard.py β€” 10 new tests covering schema
  + empty reads, record round-trip + auto-id, top_n sort/cap/filter,
  top_n("all") global, chronological history, compare latest-per-
  suite + winner, score range validation, load_seed bulk insert,
  cross-instance persistence + KNOWN_SUITES contract.
- test_main.py CLI tests renamed (leaderboard β†’ rank), assert
  ranking.json artefact.
- test_ranking.py β€” Phase-7's 20 tests carried over verbatim.

Full test suite: 189 Python tests pass, 7 Rust tests pass, cargo
clippy clean. demo/leaderboard.db is gitignored.

Versions bumped: pyproject.toml + toki/__init__.py β†’ 0.10.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wesleyscholl wesleyscholl merged commit 939f04a into main May 9, 2026
2 checks passed
@wesleyscholl wesleyscholl deleted the claude/keen-ramanujan-a177e3 branch May 9, 2026 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants