Skip to content

skill: replace before-and-after with persona-driven version#344

Open
akanksha276 wants to merge 30 commits into
mainfrom
persona-before-and-after
Open

skill: replace before-and-after with persona-driven version#344
akanksha276 wants to merge 30 commits into
mainfrom
persona-before-and-after

Conversation

@akanksha276

Copy link
Copy Markdown
Contributor

Summary

Replaces .claude/skills/before-and-after/ with .claude/skills/persona-before-and-after/, a superset that supports three persona input modes:

  1. Dataset personas (recommended) — agents built from versioned preference + strategy files in the agent-personas repo; reproducible across runs
  2. Inline personas — describe agents yourself in the prompt; the skill writes SOUL.md from your description (same flexibility as the old before-and-after skill)
  3. Custom SOUL.md — bring your own files, same as before

Key additions over the old skill:

  • Before-case uses preference-only personas (no strategy injection) as a clean control; after-case injects the full negotiation strategy — isolating the Mycelium protocol contribution
  • summarize_experiments.py aggregates results across multiple experiment gists, scores issue/option recall/F1 against all_missions_set1_gold.json, and synthesises a verdict via LLM
  • Validated across 9 experiments (ex01–ex09); results at https://gist.github.com/akanksha276/b7f61c246891e0e335666f034849b90d

Test plan

  • Run persona-before-and-after list to verify persona dataset clone works
  • Run one experiment end-to-end with dataset personas (e.g. ex03_personal_planning — fastest, ~4 rounds)
  • Run one experiment with inline personas to verify the old workflow still works
  • Run summarize_experiments.py against the 9 published gists and verify output matches the summary gist
  • Confirm no other skill or doc references the removed before-and-after skill

akanksha276 and others added 30 commits May 5, 2026 09:34
…ter skill

anthropic/claude-haiku-* requires a direct Anthropic API key which isn't
configured in this environment — agents auth through litellm. Fixes the
"No API key found for provider anthropic" error on agent bootstrap.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nd full (after) SOUL.md

Before case agents were contaminating the control by spontaneously using
`mycelium session join` — their SOUL.md included the strategy/negotiate
block which describes the CLI protocol.

Phase 0.7 now produces two files:
- exp_personas_before.json — preference parts only (keys not in {negotiate, general})
- exp_personas_after.json  — full persona (preference + strategy)

Phase 1b writes preference-only SOUL.md for the before case.
Phase 3a rewrites SOUL.md with the full persona before the after case runs.

All other references (agent list, openclaw config, prompt derivation, cleanup)
updated to use the appropriate file.
…case seed

Agents were prematurely declaring 'CONSENSUS LOCKED' in chat before
CognitiveEngine confirmed anything, which confused other agents about
the actual negotiation state.

Phase 3b seed now instructs agents to:
- Never self-declare consensus — only CE's 'consensus' message is authoritative
- On CE timeout/broken: post a final message with their last accepted position
  so the transcript has a readable end state for evaluation

Also caps per-turn chat narration to 1-2 sentences.
…uation

- Summary table now tracks messages exchanged (both cases) and engine
  ticks/rounds (after case only) instead of the rounds-centric metric
  that had no equivalent in the before case
- Add message-counting script before report generation so values are
  auto-derived from transcripts rather than hand-filled

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 5 now copies before/after transcripts, session transcripts, and
ingest logs into ~/.mycelium/rooms/${EXP_ID}/ before deleting room dirs.
Phase 4b gist staging reads from the eval dir first, falling back to
live room dirs.

Required for summarize_experiments.py to compute issue recall/F1 scores.
…gest events, fuzzy match

- Add Before/After Negotiation Moves columns (parsed from evaluation.md Summary table)
- Remove Before/After Rounds columns
- Parse ingest events directly from *-ingest-stats.json gist files
- Fix after-case issue recall (0%) with containment-based fuzzy matching + lower Jaccard threshold (0.5→0.3)

Signed-off-by: akanksha276 <akanksha276@gmail.com>
Before moves = non-facilitator chat messages in before-transcript.md.
After moves = direct session actions in after-session-transcript.md.
Removes dependency on evaluation.md table being manually populated.

Signed-off-by: akanksha276 <akanksha276@gmail.com>
Signed-off-by: akanksha276 <akanksha276@gmail.com>
…table

Signed-off-by: akanksha276 <akanksha276@gmail.com>
Signed-off-by: akanksha276 <akanksha276@gmail.com>
Signed-off-by: akanksha276 <akanksha276@gmail.com>
Signed-off-by: akanksha276 <akanksha276@gmail.com>
- Replace per-experiment verdict concatenation with LLM synthesis via
  LiteLLM proxy (haiku by default, SUMMARY_MODEL env override)
- Falls back to concatenation if LLM call fails
- Remove Aggregate Statistics table from report
- Strip markdown headers from LLM output

Signed-off-by: akanksha276 <akanksha276@gmail.com>
- Add inline persona option (Phase 0.55) alongside dataset path
- Fix all backend curl paths to use /api/ prefix
- Fix plugin path: adapters/ → integrations/openclaw/assets/
- Upgrade Phase 0.8 to check configured openclaw model first
- Update agent-personas repo to mycelium-io org

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Est. input tokens regex was anchored with \s* after the label,
causing it to miss rows like:
  | Est. input tokens (exp-2318 total) | ...
  | Est. input tokens (total buffer) | ...

Relaxed to [^|]* so any text between the label and the first pipe
is consumed, fixing blank #Tokens for ex03 and ex04.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…only

Signed-off-by: akanksha276 <akanksha276@gmail.com>
…atching

Add _extract_issues_from_ce_consensus() to parse CognitiveEngine consensus/plan
lines from all three transcript formats observed across experiments:
  - [coordination_consensus] CognitiveEngine: {JSON} (inline, ex07 style)
  - [coordination_consensus] CognitiveEngine:\n{JSON} (next-line, ex01 style)
  - [CognitiveEngine] {JSON} (inline prefix, ex09 style)
  - **CognitiveEngine:**\n{JSON} (markdown, exp-4919 style)

Extraction uses three strategies in priority order: full JSON parse of
"assignments" dict, regex over truncated "assignments" blocks, and semicolon-
delimited "plan": "key=value" string parsing.

Add _EmbedMatcher using fastembed BAAI/bge-small-en-v1.5 (threshold 0.70) to
replace Jaccard-only fuzzy matching. Auto-detects the Mycelium backend venv;
falls back to Jaccard when fastembed is unavailable. Eliminates 0% after-recall
caused by CE using agent-generated labels rather than gold standard labels.

After-recall improvement across 9 experiments: ex04 8%→108%, ex05 0%→50%,
ex06 0%→55%, ex09 38%→100%. Trim candidates reduced from 29 to 17.
Replace Jaccard 0.4 threshold with embedding cosine similarity (0.65) when
comparing found option values against gold option strings. CE negotiated
values are paraphrases of gold options, not near-copies — Jaccard was
producing 0% after-option-recall across all experiments.

After-option-recall improvement: ex03 0%→58%, ex04 0%→81%, ex05 0%→64%,
ex06 0%→39%, ex09 0%→71%.
Add _extract_options_from_ce_consensus() to extract {issue: resolved_value}
pairs from CE consensus/plan lines (assignments dict and plan key=value string).
These are added to the options dict so the value matcher can compare the CE's
single agreed value against gold option descriptions. Fixes 0% after-option-
recall for ex01/ex07/ex08 where no offer-tick payloads were present.

Also upgrade compute_option_metrics() key routing to use embedding similarity
(threshold 0.70) in addition to Jaccard/word-overlap. CE plan keys like
"broad diversification required" don't word-overlap with gold issue names like
"sector exposure", but embedding similarity is 0.69+.

Remaining 0% after-option-recall for ex07/ex08 reflects genuine CE coverage
gaps (plan only resolved 2-3 issues vs 10-11 gold), not a matching failure.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: akanksha276 <akanksha276@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant