Skip to content

[codex] Upgrade Agent Studio workbench#3

Open
ColinLi98 wants to merge 18 commits into
mainfrom
codex/agent-studio-workbench-local-launch
Open

[codex] Upgrade Agent Studio workbench#3
ColinLi98 wants to merge 18 commits into
mainfrom
codex/agent-studio-workbench-local-launch

Conversation

@ColinLi98
Copy link
Copy Markdown
Owner

@ColinLi98 ColinLi98 commented May 10, 2026

What changed

  • Upgrades Agent Studio into the interactive local workbench: startup page, reader-first Studio layout, director panel, choice cards, branch map, quality/export status, and .nosbook export support.
  • Adds ReaderShell async generation hardening for queued /v1/reader/continue responses with job polling, resume-on-stale, session reload, and visible wait state.
  • Adds Agent Studio rendered smoke coverage, desktop/mobile screenshot artifacts, sticky director checks, mobile bounded choice scroll checks, and visual review checklist output.
  • Adds scripts/run_agent_studio_local.sh, which starts the local backend and automatically opens /app?product=author&workspace=studio&debug=1 after /health passes. Set AGENT_STUDIO_OPEN_BROWSER=0 to disable browser opening.
  • Refreshes the standard cross-pack benchmark baseline after CI showed the initial-import baseline was stale while current benchmark quality passes at 1.000.

Merge gate evidence

  • Lane: Lane B
  • Phase: Phase 2
  • Task: Task AS-4/AS-5 follow-up hardening
  • Goal met: yes
  • Out-of-scope changes introduced: no
  • Tests run: targeted Agent Studio/ReaderShell/frontend smoke contract tests, targeted AuthorWork branch/export tests, cross-pack merge gate tests, and local rendered Agent Studio smoke
  • Benchmark / eval run: standard cross-pack benchmark plus merge gate after baseline refresh
  • strongest pack delta: current strongest remains synthetic_min_pack and urban_mystery_lotus_lane after baseline refresh
  • weakest pack delta: current weakest remains jade_court_romance, tide_archive_memory_debt, and xianxia_forgotten_vow after baseline refresh
  • cross-pack pass-rate delta: +0.000 against refreshed baseline; prior stale baseline showed +0.067 before refresh
  • issue category delta (Q03/Q04/Q05/Q09 if relevant): no generation/planner changes; benchmark phase gate reports no blocking issue-category regression
  • rollback point: revert the PR branch commits, especially the Agent Studio UI/smoke changes and refreshed tests/benchmark_baseline.json
  • next suggested task: run long-route benchmark evidence after Agent Studio PR lands

Product impact

  • Does this move commercialization forward?: yes, it makes Author-side local creation usable and CI-verifiable.
  • Does this improve kernel/product/ops instead of just current-pack polish?: yes, changes are product shell, smoke reliability, PR review process, and benchmark evidence hygiene.
  • Does this make weakest packs easier to diagnose or improve?: yes, cross-pack benchmark and visual smoke artifacts now surface current weakest/strongest packs and UI regressions clearly.

Validation

  • bash -n scripts/run_agent_studio_local.sh scripts/run_agent_studio_smoke.sh scripts/run_frontend_shell_smoke.sh scripts/run_reader_shell_smoke.sh
  • node -c src/narrativeos/web/agent_studio.js && node -c src/narrativeos/web/reader_shell_v2.js && node -c scripts/verify_agent_studio_smoke.js && node -c scripts/verify_frontend_shell_smoke.js
  • .venv/bin/python -m py_compile scripts/write_agent_studio_smoke_step_summary.py scripts/write_frontend_shell_smoke_step_summary.py
  • .venv/bin/python -m pytest tests/test_agent_studio_interactive_workbench.py -q
  • .venv/bin/python -m pytest tests/test_frontend_shell_smoke_ci.py::test_frontend_shell_smoke_scripts_exist_and_are_parseable tests/test_frontend_shell_smoke_ci.py::test_agent_studio_smoke_workflow_wires_headless_runner_and_artifacts tests/test_frontend_shell_docs.py -q
  • .venv/bin/python -m pytest tests/test_reader_shell_v2.py -q
  • .venv/bin/python -m pytest tests/test_reader_shell_flow.py -q
  • .venv/bin/python -m pytest tests/test_author_works.py::test_author_work_flow_supports_generate_edit_diagnostics_and_submit tests/test_author_works.py::test_author_work_can_create_parallel_universe_branch_without_overwriting_mainline tests/test_author_works.py::test_author_work_branch_discards_mainline_future_chapters_after_selected_fork_point -q
  • /tmp/narrativeos-py312-venv/bin/python -m pytest tests/test_cross_pack_merge_gate.py tests/test_cross_pack_benchmark.py::test_cross_pack_benchmark_outputs_kernel_metrics tests/test_phase0_guardrails.py -q
  • /tmp/narrativeos-py312-venv/bin/python -m src.narrativeos.benchmark.runner --baseline-file tests/benchmark_baseline.json --database-url sqlite:///narrativeos_beta.db --markdown-out /tmp/pr3-benchmark-summary-updated.md > /tmp/pr3-benchmark-updated.json
  • /tmp/narrativeos-py312-venv/bin/python -m src.narrativeos.benchmark.merge_gate --benchmark-file /tmp/pr3-benchmark-updated.json --summary-out /tmp/pr3-merge-gate-summary-updated.md
  • CI_HEADLESS=1 APP_PORT=8018 CHROME_PORT=9238 CHROME_USER_DIR=/tmp/narrativeos-chrome-agent-studio scripts/run_agent_studio_smoke.sh
  • APP_PORT=8766 AGENT_STUDIO_OPEN_BROWSER=0 bash scripts/run_agent_studio_local.sh against a temporary /health stub, confirming the script emits the Studio URL.

Notes

Full tests/test_author_works.py was attempted but did not finish locally and was stopped after repeated high-CPU runs; the Agent Studio-relevant AuthorWork subset above passed.

@ColinLi98
Copy link
Copy Markdown
Owner Author

Agent Studio visual review accepted from artifacts/agent_studio_smoke_visual_review.md in the latest PR run (25705903844).

Viewport Check Status Evidence Reviewer note
desktop Three-column workbench review manual_review artifacts/agent_studio_smoke_desktop.png accepted
mobile Stacked workbench review manual_review artifacts/agent_studio_smoke_mobile.png accepted

@ColinLi98 ColinLi98 marked this pull request as ready for review May 12, 2026 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant