Skip to content

log eval time#392

Merged
mayinghan merged 1 commit intomainfrom
log-eval-time
Jan 5, 2026
Merged

log eval time#392
mayinghan merged 1 commit intomainfrom
log-eval-time

Conversation

@mayinghan
Copy link
Collaborator

@mayinghan mayinghan commented Jan 5, 2026

Note

Adds precise evaluation timing to aid performance analysis.

  • In pointwise mode, wrap each eval with a timer and set result.execution_metadata.eval_duration_seconds
  • In groupwise mode, time the grouped eval once and assign the same duration to each row in results
  • In all-mode, time the dataset-wide eval and propagate eval_duration_seconds to all returned rows

Written by Cursor Bugbot for commit 6bfe52e. This will update automatically on new commits. Configure here.

@xzrderek xzrderek marked this pull request as ready for review January 5, 2026 22:59
Copy link
Contributor

@xzrderek xzrderek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. it will also measure the time it takes waiting for the sema, that is intended right?

async def _execute_pointwise_eval_with_semaphore(
row: EvaluationRow,
) -> EvaluationRow:
start_time = time.perf_counter()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eval duration includes semaphore wait time, not just execution

The start_time is captured before async with semaphore:, which means eval_duration_seconds includes time spent waiting for the semaphore, not just the actual evaluation execution time. When concurrent evaluations compete for semaphore slots, reported durations will be artificially inflated. The start_time assignment needs to be moved inside the async with semaphore: block to measure only the actual evaluation time.

Additional Locations (1)

Fix in Cursor Fix in Web

if isinstance(results, list):
eval_duration = time.perf_counter() - start_time
for r in results:
r.execution_metadata.eval_duration_seconds = eval_duration
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duration set before element type validation causes AttributeError

The new loop at lines 636-637 accesses r.execution_metadata.eval_duration_seconds before the validation at lines 638-642 confirms all elements are EvaluationRow instances. If the test function returns a list containing non-EvaluationRow objects, this will raise an AttributeError instead of the helpful ValueError with guidance. This is inconsistent with the pointwise case where validation occurs before setting the duration.

Fix in Cursor Fix in Web

@mayinghan
Copy link
Collaborator Author

@xzrderek yes

@mayinghan mayinghan merged commit 3184377 into main Jan 5, 2026
17 checks passed
@mayinghan mayinghan deleted the log-eval-time branch January 5, 2026 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants