skill: replace before-and-after with persona-driven version#344
Open
akanksha276 wants to merge 30 commits into
Open
skill: replace before-and-after with persona-driven version#344akanksha276 wants to merge 30 commits into
akanksha276 wants to merge 30 commits into
Conversation
…ter skill anthropic/claude-haiku-* requires a direct Anthropic API key which isn't configured in this environment — agents auth through litellm. Fixes the "No API key found for provider anthropic" error on agent bootstrap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nd full (after) SOUL.md
Before case agents were contaminating the control by spontaneously using
`mycelium session join` — their SOUL.md included the strategy/negotiate
block which describes the CLI protocol.
Phase 0.7 now produces two files:
- exp_personas_before.json — preference parts only (keys not in {negotiate, general})
- exp_personas_after.json — full persona (preference + strategy)
Phase 1b writes preference-only SOUL.md for the before case.
Phase 3a rewrites SOUL.md with the full persona before the after case runs.
All other references (agent list, openclaw config, prompt derivation, cleanup)
updated to use the appropriate file.
…case seed Agents were prematurely declaring 'CONSENSUS LOCKED' in chat before CognitiveEngine confirmed anything, which confused other agents about the actual negotiation state. Phase 3b seed now instructs agents to: - Never self-declare consensus — only CE's 'consensus' message is authoritative - On CE timeout/broken: post a final message with their last accepted position so the transcript has a readable end state for evaluation Also caps per-turn chat narration to 1-2 sentences.
…uation - Summary table now tracks messages exchanged (both cases) and engine ticks/rounds (after case only) instead of the rounds-centric metric that had no equivalent in the before case - Add message-counting script before report generation so values are auto-derived from transcripts rather than hand-filled Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 5 now copies before/after transcripts, session transcripts, and
ingest logs into ~/.mycelium/rooms/${EXP_ID}/ before deleting room dirs.
Phase 4b gist staging reads from the eval dir first, falling back to
live room dirs.
Required for summarize_experiments.py to compute issue recall/F1 scores.
…gest events, fuzzy match - Add Before/After Negotiation Moves columns (parsed from evaluation.md Summary table) - Remove Before/After Rounds columns - Parse ingest events directly from *-ingest-stats.json gist files - Fix after-case issue recall (0%) with containment-based fuzzy matching + lower Jaccard threshold (0.5→0.3) Signed-off-by: akanksha276 <akanksha276@gmail.com>
Before moves = non-facilitator chat messages in before-transcript.md. After moves = direct session actions in after-session-transcript.md. Removes dependency on evaluation.md table being manually populated. Signed-off-by: akanksha276 <akanksha276@gmail.com>
Signed-off-by: akanksha276 <akanksha276@gmail.com>
…table Signed-off-by: akanksha276 <akanksha276@gmail.com>
Signed-off-by: akanksha276 <akanksha276@gmail.com>
Signed-off-by: akanksha276 <akanksha276@gmail.com>
Signed-off-by: akanksha276 <akanksha276@gmail.com>
- Replace per-experiment verdict concatenation with LLM synthesis via LiteLLM proxy (haiku by default, SUMMARY_MODEL env override) - Falls back to concatenation if LLM call fails - Remove Aggregate Statistics table from report - Strip markdown headers from LLM output Signed-off-by: akanksha276 <akanksha276@gmail.com>
- Add inline persona option (Phase 0.55) alongside dataset path - Fix all backend curl paths to use /api/ prefix - Fix plugin path: adapters/ → integrations/openclaw/assets/ - Upgrade Phase 0.8 to check configured openclaw model first - Update agent-personas repo to mycelium-io org Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Est. input tokens regex was anchored with \s* after the label, causing it to miss rows like: | Est. input tokens (exp-2318 total) | ... | Est. input tokens (total buffer) | ... Relaxed to [^|]* so any text between the label and the first pipe is consumed, fixing blank #Tokens for ex03 and ex04. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…only Signed-off-by: akanksha276 <akanksha276@gmail.com>
…atching
Add _extract_issues_from_ce_consensus() to parse CognitiveEngine consensus/plan
lines from all three transcript formats observed across experiments:
- [coordination_consensus] CognitiveEngine: {JSON} (inline, ex07 style)
- [coordination_consensus] CognitiveEngine:\n{JSON} (next-line, ex01 style)
- [CognitiveEngine] {JSON} (inline prefix, ex09 style)
- **CognitiveEngine:**\n{JSON} (markdown, exp-4919 style)
Extraction uses three strategies in priority order: full JSON parse of
"assignments" dict, regex over truncated "assignments" blocks, and semicolon-
delimited "plan": "key=value" string parsing.
Add _EmbedMatcher using fastembed BAAI/bge-small-en-v1.5 (threshold 0.70) to
replace Jaccard-only fuzzy matching. Auto-detects the Mycelium backend venv;
falls back to Jaccard when fastembed is unavailable. Eliminates 0% after-recall
caused by CE using agent-generated labels rather than gold standard labels.
After-recall improvement across 9 experiments: ex04 8%→108%, ex05 0%→50%,
ex06 0%→55%, ex09 38%→100%. Trim candidates reduced from 29 to 17.
Replace Jaccard 0.4 threshold with embedding cosine similarity (0.65) when comparing found option values against gold option strings. CE negotiated values are paraphrases of gold options, not near-copies — Jaccard was producing 0% after-option-recall across all experiments. After-option-recall improvement: ex03 0%→58%, ex04 0%→81%, ex05 0%→64%, ex06 0%→39%, ex09 0%→71%.
Add _extract_options_from_ce_consensus() to extract {issue: resolved_value}
pairs from CE consensus/plan lines (assignments dict and plan key=value string).
These are added to the options dict so the value matcher can compare the CE's
single agreed value against gold option descriptions. Fixes 0% after-option-
recall for ex01/ex07/ex08 where no offer-tick payloads were present.
Also upgrade compute_option_metrics() key routing to use embedding similarity
(threshold 0.70) in addition to Jaccard/word-overlap. CE plan keys like
"broad diversification required" don't word-overlap with gold issue names like
"sector exposure", but embedding similarity is 0.69+.
Remaining 0% after-option-recall for ex07/ex08 reflects genuine CE coverage
gaps (plan only resolved 2-3 issues vs 10-11 gold), not a matching failure.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: akanksha276 <akanksha276@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces
.claude/skills/before-and-after/with.claude/skills/persona-before-and-after/, a superset that supports three persona input modes:before-and-afterskill)Key additions over the old skill:
summarize_experiments.pyaggregates results across multiple experiment gists, scores issue/option recall/F1 againstall_missions_set1_gold.json, and synthesises a verdict via LLMTest plan
persona-before-and-after listto verify persona dataset clone worksex03_personal_planning— fastest, ~4 rounds)summarize_experiments.pyagainst the 9 published gists and verify output matches the summary gistbefore-and-afterskill