Benchmark Evaluation#66
Conversation
|
|
||
| is_mcq = bool(re.search(r"(?im)^\s*options\s*:\s*$", str(task.get("input", "")))) | ||
| if is_mcq: | ||
| answer_instruction = ( |
There was a problem hiding this comment.
Should we add something like the following lines in the prompt?
- "You are a expert multiple-choice question solver"
- "The answer choice must be one of: A, B, C, D, or E".
There was a problem hiding this comment.
No, that doesn't add much to the prompt.
afkanpour
left a comment
There was a problem hiding this comment.
@afkanpour reviewed 1 file and all commit messages, and made 8 comments.
Reviewable status: 1 of 33 files reviewed, 8 unresolved discussions (waiting on Negiiiin and saidul-islam98).
scripts/flatten_inspect_logs.py line 90 at r2 (raw file):
accuracy = (num_correct / num_samples) if num_samples else 0.0 # In this binary setting with grades only, we treat F1 as equal to accuracy. f1 = accuracy
This is very confusing. If we don't calculate F1 here, why set it equal to accuracy? This could cause problems if someone who's unaware of this assignment uses the output json.
Code quote:
f1 = accuracysrc/eval_stages/stage1_eval_execution.py line 42 at r2 (raw file):
def _inspect_judge_model(judge_llm: Dict[str, Any]) -> Union[str, InspectModel]: """Resolve judge for Inspect so it can use a different API base than the subject.""" provider = str(judge_llm.get("provider", "openai"))
What will happen here if we can't use an openai model? For example, if we don't have quota?
What if we want to use a non-openai judge model? Can we specify this in a config?
Code quote:
provider = str(judge_llm.get("provider", "openai"))src/eval_stages/stage1_local_eval_execution.py line 81 at r2 (raw file):
num_incorrect = sum(1 for row in rows if row.get("grade") == "I") accuracy = (num_correct / num_samples) if num_samples else 0.0 f1 = accuracy
This is very confusing. If we don't calculate F1 here, why set it equal to accuracy? This could cause problems if someone who's unaware of this assignment uses the output json.
Code quote:
f1 = accuracysrc/eval_stages/stage1_local_eval_execution.py line 105 at r2 (raw file):
prompt = str(task["input"]) is_mcq = bool(re.search(r"(?im)^\s*options\s*:\s*$", str(task.get("input", ""))))
This is not a robust way to determine mcq. Can we add a problem_type field to the problem json that indicates a mcq?
Code quote:
is_mcqsrc/eval_stages/stage1_local_eval_execution.py line 108 at r2 (raw file):
if is_mcq: answer_instruction = ( "\n\nReason briefly and do not repeat yourself. Stop immediately after the final "
Why should we ask the model to do these?
Code quote:
Reason briefly and do not repeat yourself.src/eval_stages/stage1_local_eval_execution.py line 116 at r2 (raw file):
else: answer_instruction = ( "\n\nReason briefly and do not repeat yourself. Stop immediately after the final "
Same comment as above.
Code quote:
Reason briefly and do not repeat yourself.src/eval_stages/stage1_local_eval_execution.py line 345 at r2 (raw file):
if not text: return None m = re.search(r"[-+]?\d[\d,]*(?:\.\d+)?", text)
This doesn't seem to match .5
Can we replace with
m = re.search(r"[-+]?(?:\d[\d,]*(?:.\d+)?|.\d+)", text)
Code quote:
m = re.search(r"[-+]?\d[\d,]*(?:\.\d+)?", text)|
|
||
| is_mcq = bool(re.search(r"(?im)^\s*options\s*:\s*$", str(task.get("input", "")))) | ||
| if is_mcq: | ||
| answer_instruction = ( |
There was a problem hiding this comment.
No, that doesn't add much to the prompt.
afkanpour
left a comment
There was a problem hiding this comment.
@afkanpour reviewed 1 file and resolved 3 discussions.
Reviewable status: 2 of 33 files reviewed, 5 unresolved discussions (waiting on Negiiiin and saidul-islam98).
afkanpour
left a comment
There was a problem hiding this comment.
@afkanpour reviewed 31 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on Negiiiin and saidul-islam98).
PR Type
Feature
Short Description
This is used to do the benchmark evaluation on both static benchmarks and our own generated benchmarks using open source models
This change is