Skip to content

Implement batch judge calls #350

@ScuttleBot

Description

@ScuttleBot

Priority 2: Batch Judge Calls (20-30% speedup)

Problem

Currently we make 98 separate API calls to the judge (one per task). Each call has network latency overhead.

Solution

Batch judge prompts for multiple tasks into a single API call.

Implementation

Modify _grade_llm_judge() to accept a list of tasks and return a list of results.

def _batch_grade_llm_judge(
    tasks: List[Task],
    execution_results: List[Dict],
    batch_size: int = 5,
    ...
) -> List[GradeResult]:
    # Build combined prompt
    prompt = "Grade the following tasks. Respond with a JSON array:\n\n"
    for i, (task, result) in enumerate(zip(tasks, execution_results)):
        prompt += f"Task {i+1}: {task.task_id}\n{_summarize_transcript(result)}\n\n"
    prompt += "Response format: [{task_id, scores, total, notes}, ...]"
    
    # Single API call
    response = call_judge_api(prompt, ...)
    return _parse_batch_response(response)

Expected Impact

  • API round-trips: 98 → ~20 (batch of 5)
  • Network latency savings: ~30-60 seconds per model

Testing Plan

  1. Run sample of 10 tasks single vs batched
  2. Compare scores for consistency
  3. Validate JSON parsing handles batch responses

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions