Priority 2: Batch Judge Calls (20-30% speedup)
Problem
Currently we make 98 separate API calls to the judge (one per task). Each call has network latency overhead.
Solution
Batch judge prompts for multiple tasks into a single API call.
Implementation
Modify _grade_llm_judge() to accept a list of tasks and return a list of results.
def _batch_grade_llm_judge(
tasks: List[Task],
execution_results: List[Dict],
batch_size: int = 5,
...
) -> List[GradeResult]:
# Build combined prompt
prompt = "Grade the following tasks. Respond with a JSON array:\n\n"
for i, (task, result) in enumerate(zip(tasks, execution_results)):
prompt += f"Task {i+1}: {task.task_id}\n{_summarize_transcript(result)}\n\n"
prompt += "Response format: [{task_id, scores, total, notes}, ...]"
# Single API call
response = call_judge_api(prompt, ...)
return _parse_batch_response(response)
Expected Impact
- API round-trips: 98 → ~20 (batch of 5)
- Network latency savings: ~30-60 seconds per model
Testing Plan
- Run sample of 10 tasks single vs batched
- Compare scores for consistency
- Validate JSON parsing handles batch responses
Priority 2: Batch Judge Calls (20-30% speedup)
Problem
Currently we make 98 separate API calls to the judge (one per task). Each call has network latency overhead.
Solution
Batch judge prompts for multiple tasks into a single API call.
Implementation
Modify
_grade_llm_judge()to accept a list of tasks and return a list of results.Expected Impact
Testing Plan