Skip to content

Spracks/Schnarps

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Schnarps

Python-first implementation of the Schnarps project plan:

  • Rules-complete game engine (new_game, legal_actions, step)
  • CLI play modes (play-human-vs-bot, bot-vs-bot, replay)
  • Baseline bot policies (random + heuristic)
  • RL environment wrapper + masked PPO trainer scaffold
  • JSONL logging + Parquet export pipeline
  • Advisor API scaffold for UI integration

Quickstart

python -m venv .venv
source .venv/bin/activate
pip install -e '.[test,data,rl]'
pytest

Conda alternative:

conda activate schnarps
pip install -e '.[test,data,rl]'
pytest

Run CLI:

schnarps play-human-vs-bot --seed 1 --human-player 0
schnarps bot-vs-bot --games 10 --seed 10 --log data/games.jsonl
schnarps replay data/games.jsonl
schnarps-export data/games.jsonl data/games.parquet

Run simulator script:

python scripts/simulate.py --games 1000 --output data/games.jsonl

Phase-2 pilot generation from trained checkpoint:

python scripts/generate_phase2_pilot.py \
  --checkpoint experiments/ppo_sweep_2026-02-26/cfg3_best.pt \
  --games 5000 \
  --output data/phase2/pilot_games.jsonl

Build canonical Phase-2 events parquet:

python scripts/build_phase2_canonical.py \
  --inputs data/phase2/pilot_games_concurrent.parquet data/phase2/pilot_games_batch2.parquet \
  --output data/phase2/canonical_events.parquet

Extract first-pass supervised bid/move datasets:

python scripts/extract_phase2_supervised.py \
  --inputs data/phase2/pilot_games_concurrent.jsonl data/phase2/pilot_games_batch2.jsonl \
  --output-dir data/phase2/supervised \
  --max-games 500

Full-scale sharded extraction:

python scripts/extract_phase2_supervised.py \
  --inputs data/phase2/pilot_games_concurrent.jsonl data/phase2/pilot_games_batch2.jsonl \
  --output-dir data/phase2/supervised_full \
  --max-games 0 \
  --shard-rows 250000 \
  --progress-every 500

Train PPO baseline:

python scripts/train_ppo.py --updates 20 --steps-per-update 2048 --checkpoint checkpoints/ppo_latest.pt

Phase-scoped PPO training for separate checkpoints:

python scripts/train_ppo.py --policy-scope in_out --checkpoint checkpoints/ppo_in_out_latest.pt
python scripts/train_ppo.py --policy-scope play --checkpoint checkpoints/ppo_play_latest.pt

By default, training also tracks and saves the best eval checkpoint at checkpoints/ppo_best.pt. Disable with --no-track-best.

If you change evaluation regime substantially (for example switching from full-policy eval semantics to scoped phase-only eval), write to a fresh --best-checkpoint path to avoid stale best-threshold carryover.

Seat randomization (recommended):

python scripts/train_ppo.py --seat-mode random --learner-seats 0,1,2,3,4

Deterministic seat balancing (useful for stability diagnostics):

python scripts/train_ppo.py --seat-mode cycle --learner-seats 0,1,2,3,4

Phase-head PPO training (new track):

python scripts/train_ppo.py \
  --model-variant phase_heads \
  --seat-mode random \
  --learner-seats 0,1,2,3,4 \
  --opponent-mix heuristic:0.6,random:0.2,self:0.2 \
  --entropy-coef 0.01 \
  --entropy-coef-bid 0.02 \
  --eval-stochastic-episodes 20 \
  --entropy-coef-play 0.005 \
  --hand-strength-signal-coef 0.05

Heuristic-majority league training (recommended robust default):

python scripts/train_ppo.py \
  --policy-scope play \
  --opponent-mix-preset heuristic_majority \
  --opponent-checkpoint-dir experiments/opponent_league \
  --opponent-checkpoint-limit 12 \
  --snapshot-dir experiments/opponent_league \
  --snapshot-prefix play_league \
  --snapshot-every-updates 5 \
  --snapshot-max-keep 30

Conservative phase-head evaluation gates:

python scripts/evaluate_ppo_phase_heads.py \
  --candidate-checkpoint checkpoints/ppo_best.pt \
  --baseline-checkpoint experiments/ppo_sweep_2026-02-26/cfg3_best.pt \
  --games-per-matchup 2000 \
  --opening-bid-games 500 \
  --min-heuristic-win-rate 0.20 \
  --output experiments/ppo_phase_heads/eval_report.json

Dedicated bidding pipeline (recommended next for bidding collapse):

# 1) Train bid-only supervised model (with optional temperature calibration)
python scripts/train_bid_supervised.py \
  --bid-data data/phase2/supervised_full/bid_decisions/*.parquet data/phase2/supervised_v2_delta/bid_decisions/*.parquet \
  --output experiments/bid_supervised_v1/bid_model.pt \
  --summary-out experiments/bid_supervised_v1/summary.json \
  --class-balance inverse_sqrt \
  --calibrate-temperature

# 2) Warm-start dedicated bid-only PPO from supervised bid model
python scripts/train_bid_policy.py \
  --model-variant phase_heads \
  --stable-policy ppo \
  --stable-ppo-checkpoint experiments/ppo_sweep_2026-02-26/cfg3_best.pt \
  --warm-start-bid-supervised experiments/bid_supervised_v1/bid_model.pt \
  --bid-anchor-supervised experiments/bid_supervised_v1/bid_model.pt \
  --bid-anchor-coef 0.03 \
  --bid-anchor-temperature 0.8 \
  --hand-strength-signal-coef 0.05 \
  --checkpoint experiments/bid_ppo_v1/latest.pt \
  --best-checkpoint experiments/bid_ppo_v1/best.pt \
  --metrics-out experiments/bid_ppo_v1/metrics.jsonl

# 3) Evaluate bid-tuned PPO checkpoint with conservative gate script
python scripts/evaluate_ppo_phase_heads.py \
  --candidate-checkpoint experiments/bid_ppo_v1/best.pt \
  --candidate-fallback-checkpoint experiments/ppo_sweep_2026-02-26/cfg3_best.pt \
  --baseline-checkpoint experiments/ppo_sweep_2026-02-26/cfg3_best.pt \
  --games-per-matchup 2000 \
  --opening-bid-games 500 \
  --output experiments/bid_ppo_v1/eval.json

# 4) Evaluate bid-policy quality/calibration on held-out bid states
python scripts/evaluate_bid_policy_quality.py \
  --candidate-checkpoint experiments/bid_ppo_v1/best.pt \
  --sample-rows 300000 \
  --output experiments/bid_ppo_v1/bid_quality.json

Stability-first bid pipeline (fixed baseline + fixed eval seeds + burst training + deterministic promotion):

python scripts/run_bid_stability_pipeline.py \
  --base-checkpoint experiments/bid_ppo_v6_handstrength_2026-02-27/hs_cfg1_best.pt \
  --baseline-checkpoint experiments/ppo_sweep_2026-02-26/cfg3_best.pt \
  --stable-ppo-checkpoint experiments/ppo_sweep_2026-02-26/cfg3_best.pt \
  --bursts 8 \
  --burst-updates 8 \
  --early-stop-patience-updates 4 \
  --early-stop-warmup-updates 4 \
  --early-stop-min-delta 0.002

The default promotion gate set now requires:

  • non_regression_vs_baseline
  • seat_stability
  • heuristic_strength

The pipeline writes:

  • summary.json with burst-by-burst metrics + promotion decisions.
  • REPORT.md with a compact comparison table.
  • promoted/promoted_latest.pt as the current promoted checkpoint.

Bundle-first scoped league pipeline:

python scripts/run_phase_league_pipeline.py \
  --python-exec /opt/anaconda3/envs/schnarps/bin/python \
  --experiment-dir experiments/phase_league_2026-03-10_r1 \
  --fallback-bundle experiments/long_explore_v1_2026-03-04/policy_bundle_long_v1.json \
  --baseline-bundle experiments/long_explore_v1_2026-03-04/policy_bundle_long_v1.json \
  --extra-opponent-bundles experiments/parallel_triple_v1_2026-03-05/policy_bundle_selected.json \
  --run-bid-rollout-refresh

The pipeline:

  • trains in_out and play against routed bundle opponents via strong_bundle_league
  • evaluates the resulting candidate bundle against heuristic plus strong bundle opponents
  • refreshes bid rollout labels only if the candidate bundle clears the required strength gates

Transfer-focused in_out curriculum (current recommended RL training path):

python scripts/run_in_out_transfer_curriculum.py \
  --python-exec /opt/anaconda3/envs/schnarps/bin/python \
  --experiment-dir experiments/in_out_transfer_curriculum_2026-03-11_r1

The pipeline:

  • runs heuristic-anchor bursts first to recover isolated in_out signal
  • switches to learned-partner bursts using routed bundle fallbacks and strong-bundle opponents
  • validates every burst with routed fixed-partner bundle comparison
  • promotes only when the candidate clears the transfer gates from docs/EVAL_PROTOCOL_V3.md

Current strategy reset rules:

  • bid: prefer rollout-supervised training, and select checkpoints from routed external eval
  • in_out: active RL scope via run_in_out_transfer_curriculum.py
  • play: current PPO loop is paused until its reward/curriculum is redesigned

Heuristic-backed scoped bootstrap with routed comparison reports:

python scripts/run_scoped_heuristic_bootstrap.py \
  --python-exec /opt/anaconda3/envs/schnarps/bin/python \
  --experiment-dir experiments/scoped_heuristic_bootstrap_2026-03-10_r1

The pipeline:

  • trains bid, in_out, and play with heuristic partners to reduce cross-module poisoning
  • builds routed incumbent and candidate bundles for each scope
  • writes scope-specific routed comparison reports against the routed incumbent plus strong extra bundles

When evaluating a scoped checkpoint directly (for example a bid-only PPO model), pass a full fallback via --candidate-fallback-checkpoint so only the owned phase comes from the candidate.

Early-stop controls are also available directly in:

  • scripts/train_bid_policy.py
  • scripts/train_ppo.py

Train supervised advisor models (bid/move):

python scripts/train_supervised_v1.py \
  --bid-data data/phase2/supervised_full/bid_decisions/*.parquet data/phase2/supervised_v2_delta/bid_decisions/*.parquet \
  --move-data data/phase2/supervised_full/move_decisions/*.parquet data/phase2/supervised_v2_delta/move_decisions/*.parquet \
  --output-dir experiments/supervised_v3

Seat-balanced + seed-paired advisor evaluation:

python scripts/evaluate_supervised_seat_balanced.py \
  --bid-model experiments/supervised_v3/bid_model.pt \
  --move-model experiments/supervised_v3/move_model.pt \
  --games-per-matchup 5000 \
  --output experiments/supervised_v3/eval_seat_balanced_5k.json

Build Phase-4 vertical-slice payload:

python scripts/build_phase4_vertical_slice.py \
  --advisor-policy ppo \
  --policy-bundle docs/examples/policy_bundle_v1.json \
  --ppo-value-temperature 2.0 \
  --risk-profile balanced \
  --output apps/tauri-ui/mock/vertical_slice.json

Run local Phase-4 bridge + interactive UI:

python scripts/run_phase4_bridge.py \
  --advisor-policy ppo \
  --policy-bundle docs/examples/policy_bundle_v1.json \
  --opponent-policy heuristic \
  --risk-profile balanced \
  --static-dir apps/tauri-ui/mock \
  --port 8765

Latency/stability benchmark for advisor inference:

python scripts/benchmark_advisor_latency.py \
  --advisor-policy ppo \
  --policy-bundle docs/examples/policy_bundle_v1.json \
  --output experiments/phase4_alpha/latency_report_ppo.json

PPO-teacher distillation + calibration (Phase-3 refinement):

python scripts/train_distilled_from_ppo.py \
  --teacher-checkpoint experiments/bid_ppo_v6_handstrength_2026-02-27/hs_cfg1_best.pt \
  --output-dir experiments/supervised_distill_v1

python scripts/calibrate_supervised_temperature.py \
  --bid-checkpoint experiments/supervised_distill_v1/bid_model.pt \
  --move-checkpoint experiments/supervised_distill_v1/move_model.pt \
  --output-dir experiments/supervised_distill_v1/calibrated

Compute formal Phase-4 go/no-go decision:

python scripts/phase4_go_no_go.py \
  --canonical-summary data/phase2/supervised_v2_merged_summary.json \
  --train-summary experiments/supervised_v3/summary.json \
  --eval-summary experiments/supervised_v3/eval_seat_balanced_5k.json \
  --calibration-summary experiments/supervised_distill_v1/calibrated/summary.json \
  --output experiments/supervised_v3/phase4_go_no_go.json

Project Layout

  • schnarps/engine: deterministic rules engine
  • schnarps/bots: baseline policies
  • schnarps/rl: self-play / evaluation scaffolding
  • schnarps/data: logging, replay, export
  • schnarps/advisor: recommendation API contract
  • apps/tauri-ui: UI phase placeholder
  • tests: rule and determinism tests

New architecture/reference docs:

  • docs/ARCHITECTURE_RL.md
  • docs/POLICY_BUNDLE_SPEC.md
  • docs/TRAINING_PHASED_3POLICY.md
  • docs/PROBABILITY_MODEL.md
  • docs/EVAL_PROTOCOL_V3.md
  • docs/EVAL_PROTOCOL_V2.md

Current Assumptions

The engine follows Rules.md plus the explicit planning decisions finalized in chat:

  • 52-card deck, 5 cards per active player
  • Bids are pass or integer 1..5
  • All-pass forces dealer to bid 1
  • High bidder always IN
  • Forced-IN when trump is spades or high bid is 1
  • Punt overrides trick-based reduction
  • Sitting at score <=5 adds +1
  • Immediate win at score <=0; elimination at >=32
  • Dealer rotates +1 seat, skipping eliminated players

About

Playable Schnarps game with several AI-agents trained via Reinforcement Learning and probability calculators for every move.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors