feat(executor): add SWE-bench batch evaluation with hotkey auth and WebSocket streaming by echobt · Pull Request #2 · PlatformNetwork/term-executor

echobt · 2026-02-17T15:48:16Z

Summary

Rearchitects the executor from single-evaluation mode to a batch evaluation system designed for SWE-bench workloads. Introduces hotkey-based authentication, concurrent task execution with configurable parallelism, and real-time WebSocket streaming of evaluation progress.

Changes

Authentication

Replace Bearer token auth with SS58 hotkey verification via X-Hotkey header
Add bs58 crate for SS58 address validation
Restrict task submission to a single authorized hotkey

Batch Execution Engine

Replace single POST /evaluate with POST /submit accepting multipart archives containing tasks/ and agent_code/ directories
Add Batch / BatchResult / BatchStatus session model replacing the single-eval Session
Support concurrent task execution (default 8, configurable via --concurrent-tasks) using a tokio::Semaphore
Compute per-task binary reward (1.0 if all tests pass, 0.0 otherwise) and aggregate batch reward

Real-time Streaming

Add src/ws.rs module with WebSocket endpoint at /ws?batch_id=<id>
Broadcast task status transitions and test results over tokio::sync::broadcast channels
Support both polling (GET /batch/{id}) and WebSocket for live progress

Docker

Change CMD to ENTRYPOINT so the Rust server is the sole entrypoint
Add Python 3 and build-essential to the runtime image for SWE-bench task execution

Metrics

Update metrics tracking from single eval counters to batch-level counters

Breaking Changes

POST /evaluate endpoint removed; use POST /submit with multipart archive
Bearer token authentication removed; use X-Hotkey header with authorized SS58 hotkey
Response schema changed from EvalResult to BatchResult

…cket streaming Redesign the term-executor from a single-evaluation model to a batch-oriented SWE-bench evaluation system with real-time streaming and hardcoded hotkey authentication. Core architecture changes: - Replace bearer token auth with SS58 hotkey validation via X-Hotkey header, restricted to a single authorized hotkey (5GziQCcRpN...Dag2At) - Replace single-eval model with batch processing: upload a multipart archive containing tasks/ and agent_code/ directories, execute all tasks with configurable concurrency (--concurrent-tasks, default 8, via semaphore) - Binary reward system: 1.0 if all tests pass, 0.0 otherwise; aggregate reward is the mean across all tasks in a batch - Agent code is never exposed in any API response New modules and API surface: - src/ws.rs: WebSocket handler at /ws?batch_id=<id> providing real-time events (snapshot on connect, then task_started, task_complete, batch_complete) via broadcast channels - POST /submit: multipart archive upload replacing POST /evaluate - GET /batch/{id}, /batch/{id}/tasks, /batch/{id}/task/{task_id}: batch and task status polling endpoints replacing GET /evaluate/{id} - GET /batches: list all batches Session and executor refactoring: - src/session.rs: Replace Session/EvalResult/EvalStatus with Batch/BatchResult/ BatchStatus/TaskResult/TaskStatus types; add WsEvent broadcast channel per batch; add has_active_batch() to enforce single-batch-at-a-time constraint - src/executor.rs: spawn_batch() runs tasks concurrently via tokio semaphore, each task goes through clone→install→agent→test pipeline independently; emits WsEvent on task start/complete and batch complete - src/task.rs: Add extract_uploaded_archive() to parse zip/tar.gz archives with tasks/ and agent_code/ structure; add SweForgeTask and ExtractedArchive types Supporting changes: - src/auth.rs: Simplified to verify_hotkey() + extract_hotkey() + SS58 validation via bs58 crate - src/config.rs: Remove auth_token, add AUTHORIZED_HOTKEY constant, rename max_concurrent_evals→max_concurrent_tasks, increase defaults (TTL 7200s, clone timeout 180s, archive limit 500MB) - src/metrics.rs: Rename eval counters to batch/task counters (batches_total, batches_active, tasks_passed, tasks_failed, etc.) - src/main.rs: Remove semaphore creation, wire up new AppState - Cargo.toml: Bump to v0.2.0, add axum ws/multipart features, add bs58, futures, tokio-stream dependencies - Dockerfile: Add python3/pip/venv/build-essential for SWE-bench task execution, change CMD to ENTRYPOINT - README.md: Complete rewrite documenting new batch API, archive format, WebSocket protocol, reward model, and configuration

coderabbitai · 2026-02-17T15:48:34Z

Warning

Rate limit exceeded

@echobt has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 19 minutes and 21 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/swe-bench-batch-eval-websocket

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…less into()

…eval-websocket

…eval-websocket # Conflicts: # Cargo.toml

# [1.1.0](v1.0.0...v1.1.0) (2026-02-17) ### Features * **executor:** add SWE-bench batch evaluation with hotkey auth and WebSocket streaming ([#2](#2)) ([8bfa8ee](8bfa8ee))

github-actions · 2026-02-17T16:01:24Z

🎉 This PR is included in version 1.1.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

echobt added 2 commits February 17, 2026 15:47

ci: trigger CI run

aa62fc2

echobt added 6 commits February 17, 2026 15:49

style: fix CI — apply rustfmt formatting

9a40e4c

fix: resolve clippy warnings — identical if/else, collapsible if, use…

de400e6

…less into()

style: fix remaining rustfmt formatting issues

5753df3

Merge remote-tracking branch 'origin/main' into feat/swe-bench-batch-…

501a6b8

…eval-websocket

ci: trigger CI after merge with main

ba5c8ac

Merge remote-tracking branch 'origin/main' into feat/swe-bench-batch-…

0f3aa23

…eval-websocket # Conflicts: # Cargo.toml

echobt merged commit 8bfa8ee into main Feb 17, 2026
4 checks passed

github-actions bot added the released label Feb 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(executor): add SWE-bench batch evaluation with hotkey auth and WebSocket streaming#2

feat(executor): add SWE-bench batch evaluation with hotkey auth and WebSocket streaming#2
echobt merged 8 commits intomainfrom
feat/swe-bench-batch-eval-websocket

echobt commented Feb 17, 2026

Uh oh!

coderabbitai bot commented Feb 17, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

echobt commented Feb 17, 2026

Summary

Changes

Authentication

Batch Execution Engine

Real-time Streaming

Docker

Metrics

Breaking Changes

Uh oh!

coderabbitai bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Feb 17, 2026 •

edited

Loading