Skip to content

fix: correct ground_truth comparison in BanditEnv#190

Open
bigsawman wants to merge 1 commit into
TextArena:mainfrom
bigsawman:fix/bandit-ground-truth-comparison
Open

fix: correct ground_truth comparison in BanditEnv#190
bigsawman wants to merge 1 commit into
TextArena:mainfrom
bigsawman:fix/bandit-ground-truth-comparison

Conversation

@bigsawman
Copy link
Copy Markdown

Summary

  • ground_truth in BanditEnv is a dict mapping button names to probabilities (e.g. {"red": 0.6, "blue": 0.3, ...}), but the final-turn winner check compares a button string directly against this dict (button == self.state.game_state['ground_truth']), which always evaluates to False.
  • This means the player can never be recognized as having chosen the correct button — every game ends with an incorrect outcome and a regret-based reward.
  • Fix: find the button with the highest probability via max(..., key=...), then compare the player's choice against that.

Reproduction

ground_truth = {"red": 0.6, "blue": 0.3}
button = "red"
print(button == ground_truth)  # False — string vs dict, always False

Test plan

  • Verified that _regret() already correctly uses ground_truth as a dict (calls .values() and indexes by button name), confirming this is a dict, not a string.
  • Confirmed the fix matches the intended semantics: reward 1.0 when the player picks the highest-probability button.

`ground_truth` is a dict mapping button names to probabilities, but
the final-turn check compared a button string directly against this
dict (`button == self.state.game_state['ground_truth']`), which always
evaluates to False. This means the player can never win.

Fix: find the button with the highest probability first, then compare.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant