Skip to content

RecipeDeleteMultipleRecipesWithConstraint generates empty targets for mixed-case ingredients #365

@gss10282023

Description

@gss10282023

Summary

RecipeDeleteMultipleRecipesWithConstraint appears to build the target rows with a case-sensitive bug. The code lowercases each recipe's directions, but it does not lowercase the selected ingredient. When the selected ingredient is mixed-case, for example Parmesan, recipes whose directions contain Parmesan cheese are not selected as targets. They are instead treated as noise rows.

This can make the official target set empty (row_objects: []). The evaluator then checks that the empty target set was deleted, so an agent can delete nothing and still receive success: 1.0.

Expected behavior

For a task with this goal:

Delete the recipes from Broccoli app that use Parmesan in the directions.

any recipe whose directions contain Parmesan or Parmesan cheese should be included in row_objects and should have to be deleted for success.

Actual behavior in the attached run

The retained evaluator input has:

"ingredient": "Parmesan",
"row_objects": []

But the same native_evaluator_input.json includes multiple initialized recipe rows whose directions contain Parmesan cheese, for example:

  • Chicken Alfredo Pasta: Serve with a sprinkle of Parmesan cheese.
  • Pesto Pasta with Peas: Add Parmesan cheese before serving.
  • Chicken Caesar Salad Wrap: ... grilled chicken strips, and Parmesan cheese.
  • another Chicken Caesar Salad Wrap variant with the same Parmesan-containing base direction.

The agent searched for Parmesan, opened two Eggplant Parmesan entries, decided their directions did not mention Parmesan, and ended without deleting any recipe. Despite that, the native evaluator output reports:

"success": 1.0

Likely root cause

In RecipeDeleteMultipleRecipesWithConstraint.generate_random_params, the filters compare ingredient against r.directions.lower():

filter_fn=lambda r: ingredient not in r.directions.lower()
...
filter_fn=lambda r: ingredient in r.directions.lower()

This is not a case-insensitive comparison unless ingredient is already lowercase. _COMMON_INGREDIENTS contains mixed-case values such as Parmesan cheese and Parmesan, so:

"Parmesan" in "serve with a sprinkle of parmesan cheese"

is false.

As a result, real matching recipes are excluded from row_objects. In this run, the official evaluator checked deletion of an empty target set rather than checking the task's stated condition.

Suggested fix

Normalize both sides of the comparison:

ingredient_lower = ingredient.lower()

noise = sqlite_schema_utils.get_random_items(
    cls.n_rows_noise,
    _generate_random_recipe,
    replacement=False,
    filter_fn=lambda r: ingredient_lower not in r.directions.lower(),
)

targets = sqlite_schema_utils.get_random_items(
    n_rows,
    _generate_random_recipe,
    replacement=False,
    filter_fn=lambda r: ingredient_lower in r.directions.lower(),
)

It may also be worth rejecting or regenerating task instances where row_objects is empty, since this task class sets n_rows = 3 and the user-facing goal implies there are recipes to delete.

Files in the evidence package

  • evidence/native_evaluator_input.json: shows ingredient: "Parmesan", row_objects: [], and the Parmesan-containing noise rows.
  • evidence/native_evaluator_output.json: shows success: 1.0.
  • evidence/trajectory_steps.json: shows the agent searched/opened recipes and ended without deletion.
  • source/recipe.py: local copy of the relevant AndroidWorld task source.
  • source/sqlite_validators.py: local copy of the deletion validator used by the task.

RecipeDeleteMultipleRecipesWithConstraint_agent_a_parmesan_issue.tar.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions