Skip to content

peter-mw/llm_poker_eval

Repository files navigation

LLM Poker Equity Evaluation Task

An ML research task that evaluates an LLM's ability to calculate poker equity (win probability) in Texas Hold'em scenarios. The task tests the model's understanding of poker hand rankings, combinatorics, and probabilistic reasoning.

Overview

The LLM is given:

  • Hero's hole cards (2 cards)
  • Community board (5 cards - flop, turn, river)
  • Opponent's range description (e.g., "pocket pair", "suited cards", "broadway cards")

The LLM must calculate the hero's equity (win probability) against the opponent's range and output a single number (0-100).

Why This Task?

This task tests several capabilities:

  1. Domain knowledge: Understanding poker hand rankings and Texas Hold'em rules
  2. Probabilistic reasoning: Estimating win rates against a range of hands
  3. Combinatorics: Understanding which hands beat which
  4. Instruction following: Outputting in the exact format requested
  5. Numerical reasoning: Providing accurate percentage estimates

Installation

pip install treys anthropic

Files

  • prompt.txt - The base prompt template
  • poker_utils.py - Poker hand generation and equity calculation utilities
  • grader.py - Grading logic that checks LLM responses
  • test_runner.py - Framework for running multiple tests
  • run_claude_haiku.py - Example integration with Claude Haiku 4.5

Usage

Run a single test

from grader import get_prompt, grade

# Generate a prompt
prompt, scenario = get_prompt()
print(prompt)

# Get model response (replace with your model)
response = your_model(prompt)

# Grade the response
passed, metadata = grade(response, scenario)
print(f"Passed: {passed}")
print(f"Error: {metadata['error']} percentage points")

Run multiple tests

from test_runner import run_multiple_tests, print_summary

def your_model(prompt: str) -> str:
    # Your model implementation
    return "45"  # Example

summary = run_multiple_tests(your_model, num_tests=20)
print_summary(summary)

Test with Claude Haiku 4.5

# Set API key
$env:ANTHROPIC_API_KEY = "your-api-key"

# Run tests
python run_claude_haiku.py

Grading

The grader accepts responses if:

  • The LLM outputs a number between 0-100
  • The error is within 15 percentage points of the correct equity

The 15-point tolerance accounts for:

  • Monte Carlo simulation variance (~5-10 points)
  • Reasonable approximations in equity calculation
  • LLM's approximate reasoning about hand strength

Expected Performance

For a model to pass this task (10-40% pass rate):

  • The model needs basic poker knowledge
  • Must understand hand rankings
  • Should reason about relative hand strength
  • Needs to estimate probabilities approximately

Common Failure Modes

  1. No poker knowledge: Model doesn't understand hand rankings (gives random numbers)
  2. Overconfident: Always predicts 100% or 0%
  3. Ignores opponent range: Doesn't adjust equity based on range description
  4. Format errors: Provides explanation instead of just a number
  5. Poor probability estimation: Significantly over/underestimates equity

Example

Prompt:

Hero cards: ['Ah', 'Kd']
Board: ['Ks', '7h', '2c', 'Qd', '9s']
Opponent range: any two cards

Correct equity: ~85% (hero has top pair, king with ace kicker)

Acceptable responses: 70-100 (within 15 points)

Customization

To adjust difficulty:

  • Change tolerance in grader.py (line 69)
  • Modify range types in poker_utils.py (line 38-44)
  • Add more complex scenarios (multiple opponents, specific hand combinations)

Task Characteristics

  • Pass rate: Designed for 10-40% with claude-haiku-4-5
  • Concise: ~250 lines of code total
  • Novel: Tests poker-specific reasoning and probability estimation
  • Multiple failure modes: Knowledge gaps, probability estimation, format issues
  • Verifiable: Exact equity calculated via Monte Carlo simulation
  • ML-relevant: Tests numerical reasoning and domain knowledge

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages