Implement batch judge calls

## Priority 2: Batch Judge Calls (20-30% speedup)

### Problem
Currently we make 98 separate API calls to the judge (one per task). Each call has network latency overhead.

### Solution
Batch judge prompts for multiple tasks into a single API call.

### Implementation
Modify `_grade_llm_judge()` to accept a list of tasks and return a list of results.

```python
def _batch_grade_llm_judge(
    tasks: List[Task],
    execution_results: List[Dict],
    batch_size: int = 5,
    ...
) -> List[GradeResult]:
    # Build combined prompt
    prompt = "Grade the following tasks. Respond with a JSON array:\n\n"
    for i, (task, result) in enumerate(zip(tasks, execution_results)):
        prompt += f"Task {i+1}: {task.task_id}\n{_summarize_transcript(result)}\n\n"
    prompt += "Response format: [{task_id, scores, total, notes}, ...]"
    
    # Single API call
    response = call_judge_api(prompt, ...)
    return _parse_batch_response(response)
```

### Expected Impact
- API round-trips: 98 → ~20 (batch of 5)
- Network latency savings: ~30-60 seconds per model

### Testing Plan
1. Run sample of 10 tasks single vs batched
2. Compare scores for consistency
3. Validate JSON parsing handles batch responses

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement batch judge calls #350

Priority 2: Batch Judge Calls (20-30% speedup)

Problem

Solution

Implementation

Expected Impact

Testing Plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement batch judge calls #350

Description

Priority 2: Batch Judge Calls (20-30% speedup)

Problem

Solution

Implementation

Expected Impact

Testing Plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions