An ML research task that evaluates an LLM's ability to calculate poker equity (win probability) in Texas Hold'em scenarios. The task tests the model's understanding of poker hand rankings, combinatorics, and probabilistic reasoning.
The LLM is given:
- Hero's hole cards (2 cards)
- Community board (5 cards - flop, turn, river)
- Opponent's range description (e.g., "pocket pair", "suited cards", "broadway cards")
The LLM must calculate the hero's equity (win probability) against the opponent's range and output a single number (0-100).
This task tests several capabilities:
- Domain knowledge: Understanding poker hand rankings and Texas Hold'em rules
- Probabilistic reasoning: Estimating win rates against a range of hands
- Combinatorics: Understanding which hands beat which
- Instruction following: Outputting in the exact format requested
- Numerical reasoning: Providing accurate percentage estimates
pip install treys anthropicprompt.txt- The base prompt templatepoker_utils.py- Poker hand generation and equity calculation utilitiesgrader.py- Grading logic that checks LLM responsestest_runner.py- Framework for running multiple testsrun_claude_haiku.py- Example integration with Claude Haiku 4.5
from grader import get_prompt, grade
# Generate a prompt
prompt, scenario = get_prompt()
print(prompt)
# Get model response (replace with your model)
response = your_model(prompt)
# Grade the response
passed, metadata = grade(response, scenario)
print(f"Passed: {passed}")
print(f"Error: {metadata['error']} percentage points")from test_runner import run_multiple_tests, print_summary
def your_model(prompt: str) -> str:
# Your model implementation
return "45" # Example
summary = run_multiple_tests(your_model, num_tests=20)
print_summary(summary)# Set API key
$env:ANTHROPIC_API_KEY = "your-api-key"
# Run tests
python run_claude_haiku.pyThe grader accepts responses if:
- The LLM outputs a number between 0-100
- The error is within 15 percentage points of the correct equity
The 15-point tolerance accounts for:
- Monte Carlo simulation variance (~5-10 points)
- Reasonable approximations in equity calculation
- LLM's approximate reasoning about hand strength
For a model to pass this task (10-40% pass rate):
- The model needs basic poker knowledge
- Must understand hand rankings
- Should reason about relative hand strength
- Needs to estimate probabilities approximately
- No poker knowledge: Model doesn't understand hand rankings (gives random numbers)
- Overconfident: Always predicts 100% or 0%
- Ignores opponent range: Doesn't adjust equity based on range description
- Format errors: Provides explanation instead of just a number
- Poor probability estimation: Significantly over/underestimates equity
Prompt:
Hero cards: ['Ah', 'Kd']
Board: ['Ks', '7h', '2c', 'Qd', '9s']
Opponent range: any two cards
Correct equity: ~85% (hero has top pair, king with ace kicker)
Acceptable responses: 70-100 (within 15 points)
To adjust difficulty:
- Change tolerance in
grader.py(line 69) - Modify range types in
poker_utils.py(line 38-44) - Add more complex scenarios (multiple opponents, specific hand combinations)
- Pass rate: Designed for 10-40% with claude-haiku-4-5
- Concise: ~250 lines of code total
- Novel: Tests poker-specific reasoning and probability estimation
- Multiple failure modes: Knowledge gaps, probability estimation, format issues
- Verifiable: Exact equity calculated via Monte Carlo simulation
- ML-relevant: Tests numerical reasoning and domain knowledge
MIT