Skip to content

feat(executor): add SWE-bench batch evaluation with hotkey auth and WebSocket streaming#2

Merged
echobt merged 8 commits intomainfrom
feat/swe-bench-batch-eval-websocket
Feb 17, 2026
Merged

feat(executor): add SWE-bench batch evaluation with hotkey auth and WebSocket streaming#2
echobt merged 8 commits intomainfrom
feat/swe-bench-batch-eval-websocket

Conversation

@echobt
Copy link
Contributor

@echobt echobt commented Feb 17, 2026

Summary

Rearchitects the executor from single-evaluation mode to a batch evaluation system designed for SWE-bench workloads. Introduces hotkey-based authentication, concurrent task execution with configurable parallelism, and real-time WebSocket streaming of evaluation progress.

Changes

Authentication

  • Replace Bearer token auth with SS58 hotkey verification via X-Hotkey header
  • Add bs58 crate for SS58 address validation
  • Restrict task submission to a single authorized hotkey

Batch Execution Engine

  • Replace single POST /evaluate with POST /submit accepting multipart archives containing tasks/ and agent_code/ directories
  • Add Batch / BatchResult / BatchStatus session model replacing the single-eval Session
  • Support concurrent task execution (default 8, configurable via --concurrent-tasks) using a tokio::Semaphore
  • Compute per-task binary reward (1.0 if all tests pass, 0.0 otherwise) and aggregate batch reward

Real-time Streaming

  • Add src/ws.rs module with WebSocket endpoint at /ws?batch_id=<id>
  • Broadcast task status transitions and test results over tokio::sync::broadcast channels
  • Support both polling (GET /batch/{id}) and WebSocket for live progress

Docker

  • Change CMD to ENTRYPOINT so the Rust server is the sole entrypoint
  • Add Python 3 and build-essential to the runtime image for SWE-bench task execution

Metrics

  • Update metrics tracking from single eval counters to batch-level counters

Breaking Changes

  • POST /evaluate endpoint removed; use POST /submit with multipart archive
  • Bearer token authentication removed; use X-Hotkey header with authorized SS58 hotkey
  • Response schema changed from EvalResult to BatchResult

…cket streaming

Redesign the term-executor from a single-evaluation model to a batch-oriented
SWE-bench evaluation system with real-time streaming and hardcoded hotkey
authentication.

Core architecture changes:
- Replace bearer token auth with SS58 hotkey validation via X-Hotkey header,
  restricted to a single authorized hotkey (5GziQCcRpN...Dag2At)
- Replace single-eval model with batch processing: upload a multipart archive
  containing tasks/ and agent_code/ directories, execute all tasks with
  configurable concurrency (--concurrent-tasks, default 8, via semaphore)
- Binary reward system: 1.0 if all tests pass, 0.0 otherwise; aggregate
  reward is the mean across all tasks in a batch
- Agent code is never exposed in any API response

New modules and API surface:
- src/ws.rs: WebSocket handler at /ws?batch_id=<id> providing real-time
  events (snapshot on connect, then task_started, task_complete,
  batch_complete) via broadcast channels
- POST /submit: multipart archive upload replacing POST /evaluate
- GET /batch/{id}, /batch/{id}/tasks, /batch/{id}/task/{task_id}: batch and
  task status polling endpoints replacing GET /evaluate/{id}
- GET /batches: list all batches

Session and executor refactoring:
- src/session.rs: Replace Session/EvalResult/EvalStatus with Batch/BatchResult/
  BatchStatus/TaskResult/TaskStatus types; add WsEvent broadcast channel per
  batch; add has_active_batch() to enforce single-batch-at-a-time constraint
- src/executor.rs: spawn_batch() runs tasks concurrently via tokio semaphore,
  each task goes through clone→install→agent→test pipeline independently;
  emits WsEvent on task start/complete and batch complete
- src/task.rs: Add extract_uploaded_archive() to parse zip/tar.gz archives
  with tasks/ and agent_code/ structure; add SweForgeTask and ExtractedArchive
  types

Supporting changes:
- src/auth.rs: Simplified to verify_hotkey() + extract_hotkey() + SS58
  validation via bs58 crate
- src/config.rs: Remove auth_token, add AUTHORIZED_HOTKEY constant, rename
  max_concurrent_evals→max_concurrent_tasks, increase defaults (TTL 7200s,
  clone timeout 180s, archive limit 500MB)
- src/metrics.rs: Rename eval counters to batch/task counters
  (batches_total, batches_active, tasks_passed, tasks_failed, etc.)
- src/main.rs: Remove semaphore creation, wire up new AppState
- Cargo.toml: Bump to v0.2.0, add axum ws/multipart features, add bs58,
  futures, tokio-stream dependencies
- Dockerfile: Add python3/pip/venv/build-essential for SWE-bench task
  execution, change CMD to ENTRYPOINT
- README.md: Complete rewrite documenting new batch API, archive format,
  WebSocket protocol, reward model, and configuration
@coderabbitai
Copy link

coderabbitai bot commented Feb 17, 2026

Warning

Rate limit exceeded

@echobt has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 19 minutes and 21 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/swe-bench-batch-eval-websocket

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@echobt echobt merged commit 8bfa8ee into main Feb 17, 2026
4 checks passed
github-actions bot pushed a commit that referenced this pull request Feb 17, 2026
# [1.1.0](v1.0.0...v1.1.0) (2026-02-17)

### Features

* **executor:** add SWE-bench batch evaluation with hotkey auth and WebSocket streaming ([#2](#2)) ([8bfa8ee](8bfa8ee))
@github-actions
Copy link

🎉 This PR is included in version 1.1.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant