Skip to content

feat(skillopt): implement SkillOpt skill-document optimizer#75

Draft
wesleysimplicio wants to merge 1 commit into
mainfrom
claude/skillopt-implementation-SHUSn
Draft

feat(skillopt): implement SkillOpt skill-document optimizer#75
wesleysimplicio wants to merge 1 commit into
mainfrom
claude/skillopt-implementation-SHUSn

Conversation

@wesleysimplicio
Copy link
Copy Markdown
Owner

Summary

Implements SkillOpt (Microsoft Research — Executive Strategy for Self-Evolving Agent Skills) as a tool for this repo's .skills/ ecosystem. SkillOpt treats a natural-language skill document as the only trainable artifact and optimizes it for a frozen target model through a four-stage loop:

  • Rollout — score the current skill on the train split.
  • Reflect — turn failure/success batches into candidate edits.
  • Edit — apply up to budget add/delete/replace ops (the textual learning rate), skipping any op already in the rejected-edit buffer.
  • Gate — accept a candidate only if it improves a held-out split; otherwise its edits are buffered as negative feedback.

Output is best_skill.md, plus an optional run report and a content-addressed receipt under .catalog/receipts/ (matching the repo's receipt schema).

What's included

  • scripts/skillopt/engine.js — deterministic, dependency-free engine (optimize, reflect, applyEdits, evaluateSplit). The rollout scorer is pluggable (opts.scorer) so a real LLM adapter can replace the default heuristic without touching the loop. Runs fully offline.
  • bin/skillopt.js + cli.js skillopt subcommand — npx ... skillopt --suite <suite.json> [--skill ...] [--out best_skill.md] [--report ...] [--rounds N] [--budget N] [--no-receipt] [--json].
  • .skills/skillopt/SKILL.md — skill manifest, plus runnable example.skill.md / example.suite.json.
  • Tests: tests/unit/skillopt.test.js (engine + CLI, 15 tests) and tests/e2e/skillopt.spec.ts (Playwright CLI e2e with evidence attachments).
  • Wired into package.json bin, README "Companion tooling", CHANGELOG.md, .skills/README.md, and a .gitignore whitelist for scripts/skillopt/.

Design notes

  • The skill markdown is the only thing edited; the "model" is never fine-tuned.
  • A regression can never become best (acceptance requires candidateGate > bestGate). When no holdout tasks exist the gate falls back to the train split and reports usedHoldout: false.
  • Hardened against malformed suite JSON (null/non-object tasks, non-array directives) at the system boundary; the CLI exits 2 cleanly on bad input. Regex used in replace edits is escaped (no ReDoS / injection).

Test plan

  • npm test — 57 pass / 5 pre-existing skips, 0 fail
  • npm run lint — 0 errors
  • npx playwright test --project=chromium — 10 pass / 1 pre-existing skip (incl. new skillopt.spec.ts)
  • Manual run on the example suite: gate score 0.5 -> 1, EXIT_SIGNAL: true, best_skill.md gains the missing directives
  • Reviewer confirms the best_skill.md diff workflow reads well

https://claude.ai/code/session_01MuMv2kN3x5s6UwjXMap2ZP


Generated by Claude Code

Add the SkillOpt loop (Rollout -> Reflect -> Edit -> Gate) from
https://microsoft.github.io/SkillOpt/ as a tool for this skill
ecosystem. The skill markdown is the only trainable artifact; the
target model stays frozen and edits are accepted only when they
improve a held-out task split, with a rejected-edit buffer providing
negative feedback and an edit budget acting as the textual learning
rate.

- scripts/skillopt/engine.js: deterministic, dependency-free engine
  (optimize, reflect, applyEdits, evaluateSplit) with a pluggable
  scorer so real LLM adapters can replace the default heuristic.
- bin/skillopt.js + cli.js subcommand: optimize a SKILL.md against a
  task suite, emit best_skill.md, an optional report, and a
  content-addressed receipt under .catalog/receipts/.
- .skills/skillopt: skill manifest plus runnable example fixtures.
- Unit tests (engine + CLI) and a Playwright CLI e2e with evidence.

https://claude.ai/code/session_01MuMv2kN3x5s6UwjXMap2ZP
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants