Summary
RecipeDeleteMultipleRecipesWithConstraint appears to build the target rows with a case-sensitive bug. The code lowercases each recipe's directions, but it does not lowercase the selected ingredient. When the selected ingredient is mixed-case, for example Parmesan, recipes whose directions contain Parmesan cheese are not selected as targets. They are instead treated as noise rows.
This can make the official target set empty (row_objects: []). The evaluator then checks that the empty target set was deleted, so an agent can delete nothing and still receive success: 1.0.
Expected behavior
For a task with this goal:
Delete the recipes from Broccoli app that use Parmesan in the directions.
any recipe whose directions contain Parmesan or Parmesan cheese should be included in row_objects and should have to be deleted for success.
Actual behavior in the attached run
The retained evaluator input has:
"ingredient": "Parmesan",
"row_objects": []
But the same native_evaluator_input.json includes multiple initialized recipe rows whose directions contain Parmesan cheese, for example:
Chicken Alfredo Pasta: Serve with a sprinkle of Parmesan cheese.
Pesto Pasta with Peas: Add Parmesan cheese before serving.
Chicken Caesar Salad Wrap: ... grilled chicken strips, and Parmesan cheese.
- another
Chicken Caesar Salad Wrap variant with the same Parmesan-containing base direction.
The agent searched for Parmesan, opened two Eggplant Parmesan entries, decided their directions did not mention Parmesan, and ended without deleting any recipe. Despite that, the native evaluator output reports:
Likely root cause
In RecipeDeleteMultipleRecipesWithConstraint.generate_random_params, the filters compare ingredient against r.directions.lower():
filter_fn=lambda r: ingredient not in r.directions.lower()
...
filter_fn=lambda r: ingredient in r.directions.lower()
This is not a case-insensitive comparison unless ingredient is already lowercase. _COMMON_INGREDIENTS contains mixed-case values such as Parmesan cheese and Parmesan, so:
"Parmesan" in "serve with a sprinkle of parmesan cheese"
is false.
As a result, real matching recipes are excluded from row_objects. In this run, the official evaluator checked deletion of an empty target set rather than checking the task's stated condition.
Suggested fix
Normalize both sides of the comparison:
ingredient_lower = ingredient.lower()
noise = sqlite_schema_utils.get_random_items(
cls.n_rows_noise,
_generate_random_recipe,
replacement=False,
filter_fn=lambda r: ingredient_lower not in r.directions.lower(),
)
targets = sqlite_schema_utils.get_random_items(
n_rows,
_generate_random_recipe,
replacement=False,
filter_fn=lambda r: ingredient_lower in r.directions.lower(),
)
It may also be worth rejecting or regenerating task instances where row_objects is empty, since this task class sets n_rows = 3 and the user-facing goal implies there are recipes to delete.
Files in the evidence package
evidence/native_evaluator_input.json: shows ingredient: "Parmesan", row_objects: [], and the Parmesan-containing noise rows.
evidence/native_evaluator_output.json: shows success: 1.0.
evidence/trajectory_steps.json: shows the agent searched/opened recipes and ended without deletion.
source/recipe.py: local copy of the relevant AndroidWorld task source.
source/sqlite_validators.py: local copy of the deletion validator used by the task.
RecipeDeleteMultipleRecipesWithConstraint_agent_a_parmesan_issue.tar.gz
Summary
RecipeDeleteMultipleRecipesWithConstraintappears to build the target rows with a case-sensitive bug. The code lowercases each recipe'sdirections, but it does not lowercase the selectedingredient. When the selected ingredient is mixed-case, for exampleParmesan, recipes whose directions containParmesan cheeseare not selected as targets. They are instead treated as noise rows.This can make the official target set empty (
row_objects: []). The evaluator then checks that the empty target set was deleted, so an agent can delete nothing and still receivesuccess: 1.0.Expected behavior
For a task with this goal:
any recipe whose
directionscontainParmesanorParmesan cheeseshould be included inrow_objectsand should have to be deleted for success.Actual behavior in the attached run
The retained evaluator input has:
But the same
native_evaluator_input.jsonincludes multiple initialized recipe rows whosedirectionscontainParmesan cheese, for example:Chicken Alfredo Pasta:Serve with a sprinkle of Parmesan cheese.Pesto Pasta with Peas:Add Parmesan cheese before serving.Chicken Caesar Salad Wrap:... grilled chicken strips, and Parmesan cheese.Chicken Caesar Salad Wrapvariant with the same Parmesan-containing base direction.The agent searched for
Parmesan, opened twoEggplant Parmesanentries, decided their directions did not mention Parmesan, and ended without deleting any recipe. Despite that, the native evaluator output reports:Likely root cause
In
RecipeDeleteMultipleRecipesWithConstraint.generate_random_params, the filters compareingredientagainstr.directions.lower():This is not a case-insensitive comparison unless
ingredientis already lowercase._COMMON_INGREDIENTScontains mixed-case values such asParmesan cheeseandParmesan, so:is false.
As a result, real matching recipes are excluded from
row_objects. In this run, the official evaluator checked deletion of an empty target set rather than checking the task's stated condition.Suggested fix
Normalize both sides of the comparison:
It may also be worth rejecting or regenerating task instances where
row_objectsis empty, since this task class setsn_rows = 3and the user-facing goal implies there are recipes to delete.Files in the evidence package
evidence/native_evaluator_input.json: showsingredient: "Parmesan",row_objects: [], and the Parmesan-containing noise rows.evidence/native_evaluator_output.json: showssuccess: 1.0.evidence/trajectory_steps.json: shows the agent searched/opened recipes and ended without deletion.source/recipe.py: local copy of the relevant AndroidWorld task source.source/sqlite_validators.py: local copy of the deletion validator used by the task.RecipeDeleteMultipleRecipesWithConstraint_agent_a_parmesan_issue.tar.gz