Conversation
…cket streaming
Redesign the term-executor from a single-evaluation model to a batch-oriented
SWE-bench evaluation system with real-time streaming and hardcoded hotkey
authentication.
Core architecture changes:
- Replace bearer token auth with SS58 hotkey validation via X-Hotkey header,
restricted to a single authorized hotkey (5GziQCcRpN...Dag2At)
- Replace single-eval model with batch processing: upload a multipart archive
containing tasks/ and agent_code/ directories, execute all tasks with
configurable concurrency (--concurrent-tasks, default 8, via semaphore)
- Binary reward system: 1.0 if all tests pass, 0.0 otherwise; aggregate
reward is the mean across all tasks in a batch
- Agent code is never exposed in any API response
New modules and API surface:
- src/ws.rs: WebSocket handler at /ws?batch_id=<id> providing real-time
events (snapshot on connect, then task_started, task_complete,
batch_complete) via broadcast channels
- POST /submit: multipart archive upload replacing POST /evaluate
- GET /batch/{id}, /batch/{id}/tasks, /batch/{id}/task/{task_id}: batch and
task status polling endpoints replacing GET /evaluate/{id}
- GET /batches: list all batches
Session and executor refactoring:
- src/session.rs: Replace Session/EvalResult/EvalStatus with Batch/BatchResult/
BatchStatus/TaskResult/TaskStatus types; add WsEvent broadcast channel per
batch; add has_active_batch() to enforce single-batch-at-a-time constraint
- src/executor.rs: spawn_batch() runs tasks concurrently via tokio semaphore,
each task goes through clone→install→agent→test pipeline independently;
emits WsEvent on task start/complete and batch complete
- src/task.rs: Add extract_uploaded_archive() to parse zip/tar.gz archives
with tasks/ and agent_code/ structure; add SweForgeTask and ExtractedArchive
types
Supporting changes:
- src/auth.rs: Simplified to verify_hotkey() + extract_hotkey() + SS58
validation via bs58 crate
- src/config.rs: Remove auth_token, add AUTHORIZED_HOTKEY constant, rename
max_concurrent_evals→max_concurrent_tasks, increase defaults (TTL 7200s,
clone timeout 180s, archive limit 500MB)
- src/metrics.rs: Rename eval counters to batch/task counters
(batches_total, batches_active, tasks_passed, tasks_failed, etc.)
- src/main.rs: Remove semaphore creation, wire up new AppState
- Cargo.toml: Bump to v0.2.0, add axum ws/multipart features, add bs58,
futures, tokio-stream dependencies
- Dockerfile: Add python3/pip/venv/build-essential for SWE-bench task
execution, change CMD to ENTRYPOINT
- README.md: Complete rewrite documenting new batch API, archive format,
WebSocket protocol, reward model, and configuration
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…eval-websocket # Conflicts: # Cargo.toml
# [1.1.0](v1.0.0...v1.1.0) (2026-02-17) ### Features * **executor:** add SWE-bench batch evaluation with hotkey auth and WebSocket streaming ([#2](#2)) ([8bfa8ee](8bfa8ee))
|
🎉 This PR is included in version 1.1.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Summary
Rearchitects the executor from single-evaluation mode to a batch evaluation system designed for SWE-bench workloads. Introduces hotkey-based authentication, concurrent task execution with configurable parallelism, and real-time WebSocket streaming of evaluation progress.
Changes
Authentication
X-Hotkeyheaderbs58crate for SS58 address validationBatch Execution Engine
POST /evaluatewithPOST /submitaccepting multipart archives containingtasks/andagent_code/directoriesBatch/BatchResult/BatchStatussession model replacing the single-evalSession--concurrent-tasks) using atokio::SemaphoreReal-time Streaming
src/ws.rsmodule with WebSocket endpoint at/ws?batch_id=<id>tokio::sync::broadcastchannelsGET /batch/{id}) and WebSocket for live progressDocker
CMDtoENTRYPOINTso the Rust server is the sole entrypointMetrics
Breaking Changes
POST /evaluateendpoint removed; usePOST /submitwith multipart archiveX-Hotkeyheader with authorized SS58 hotkeyEvalResulttoBatchResult