LLM Poker Equity Evaluation Task

An ML research task that evaluates an LLM's ability to calculate poker equity (win probability) in Texas Hold'em scenarios. The task tests the model's understanding of poker hand rankings, combinatorics, and probabilistic reasoning.

Overview

The LLM is given:

Hero's hole cards (2 cards)
Community board (5 cards - flop, turn, river)
Opponent's range description (e.g., "pocket pair", "suited cards", "broadway cards")

The LLM must calculate the hero's equity (win probability) against the opponent's range and output a single number (0-100).

Why This Task?

This task tests several capabilities:

Domain knowledge: Understanding poker hand rankings and Texas Hold'em rules
Probabilistic reasoning: Estimating win rates against a range of hands
Combinatorics: Understanding which hands beat which
Instruction following: Outputting in the exact format requested
Numerical reasoning: Providing accurate percentage estimates

Installation

pip install treys anthropic

Files

prompt.txt - The base prompt template
poker_utils.py - Poker hand generation and equity calculation utilities
grader.py - Grading logic that checks LLM responses
test_runner.py - Framework for running multiple tests
run_claude_haiku.py - Example integration with Claude Haiku 4.5

Usage

Run a single test

from grader import get_prompt, grade

# Generate a prompt
prompt, scenario = get_prompt()
print(prompt)

# Get model response (replace with your model)
response = your_model(prompt)

# Grade the response
passed, metadata = grade(response, scenario)
print(f"Passed: {passed}")
print(f"Error: {metadata['error']} percentage points")

Run multiple tests

from test_runner import run_multiple_tests, print_summary

def your_model(prompt: str) -> str:
    # Your model implementation
    return "45"  # Example

summary = run_multiple_tests(your_model, num_tests=20)
print_summary(summary)

Test with Claude Haiku 4.5

# Set API key
$env:ANTHROPIC_API_KEY = "your-api-key"

# Run tests
python run_claude_haiku.py

Grading

The grader accepts responses if:

The LLM outputs a number between 0-100
The error is within 15 percentage points of the correct equity

The 15-point tolerance accounts for:

Monte Carlo simulation variance (~5-10 points)
Reasonable approximations in equity calculation
LLM's approximate reasoning about hand strength

Expected Performance

For a model to pass this task (10-40% pass rate):

The model needs basic poker knowledge
Must understand hand rankings
Should reason about relative hand strength
Needs to estimate probabilities approximately

Common Failure Modes

No poker knowledge: Model doesn't understand hand rankings (gives random numbers)
Overconfident: Always predicts 100% or 0%
Ignores opponent range: Doesn't adjust equity based on range description
Format errors: Provides explanation instead of just a number
Poor probability estimation: Significantly over/underestimates equity

Example

Prompt:

Hero cards: ['Ah', 'Kd']
Board: ['Ks', '7h', '2c', 'Qd', '9s']
Opponent range: any two cards

Correct equity: ~85% (hero has top pair, king with ace kicker)

Acceptable responses: 70-100 (within 15 points)

Customization

To adjust difficulty:

Change tolerance in grader.py (line 69)
Modify range types in poker_utils.py (line 38-44)
Add more complex scenarios (multiple opponents, specific hand combinations)

Task Characteristics

Pass rate: Designed for 10-40% with claude-haiku-4-5
Concise: ~250 lines of code total
Novel: Tests poker-specific reasoning and probability estimation
Multiple failure modes: Knowledge gaps, probability estimation, format issues
Verifiable: Exact equity calculated via Monte Carlo simulation
ML-relevant: Tests numerical reasoning and domain knowledge

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
claude_haiku_results.json		claude_haiku_results.json
config.py		config.py
examples.py		examples.py
grader.py		grader.py
poker_utils.py		poker_utils.py
prompt.txt		prompt.txt
requirements.txt		requirements.txt
run_claude_haiku.py		run_claude_haiku.py
test_results.json		test_results.json
test_runner.py		test_runner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Poker Equity Evaluation Task

Overview

Why This Task?

Installation

Files

Usage

Run a single test

Run multiple tests

Test with Claude Haiku 4.5

Grading

Expected Performance

Common Failure Modes

Example

Customization

Task Characteristics

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Poker Equity Evaluation Task

Overview

Why This Task?

Installation

Files

Usage

Run a single test

Run multiple tests

Test with Claude Haiku 4.5

Grading

Expected Performance

Common Failure Modes

Example

Customization

Task Characteristics

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages