Python-first implementation of the Schnarps project plan:
- Rules-complete game engine (
new_game,legal_actions,step) - CLI play modes (
play-human-vs-bot,bot-vs-bot,replay) - Baseline bot policies (random + heuristic)
- RL environment wrapper + masked PPO trainer scaffold
- JSONL logging + Parquet export pipeline
- Advisor API scaffold for UI integration
python -m venv .venv
source .venv/bin/activate
pip install -e '.[test,data,rl]'
pytestConda alternative:
conda activate schnarps
pip install -e '.[test,data,rl]'
pytestRun CLI:
schnarps play-human-vs-bot --seed 1 --human-player 0
schnarps bot-vs-bot --games 10 --seed 10 --log data/games.jsonl
schnarps replay data/games.jsonl
schnarps-export data/games.jsonl data/games.parquetRun simulator script:
python scripts/simulate.py --games 1000 --output data/games.jsonlPhase-2 pilot generation from trained checkpoint:
python scripts/generate_phase2_pilot.py \
--checkpoint experiments/ppo_sweep_2026-02-26/cfg3_best.pt \
--games 5000 \
--output data/phase2/pilot_games.jsonlBuild canonical Phase-2 events parquet:
python scripts/build_phase2_canonical.py \
--inputs data/phase2/pilot_games_concurrent.parquet data/phase2/pilot_games_batch2.parquet \
--output data/phase2/canonical_events.parquetExtract first-pass supervised bid/move datasets:
python scripts/extract_phase2_supervised.py \
--inputs data/phase2/pilot_games_concurrent.jsonl data/phase2/pilot_games_batch2.jsonl \
--output-dir data/phase2/supervised \
--max-games 500Full-scale sharded extraction:
python scripts/extract_phase2_supervised.py \
--inputs data/phase2/pilot_games_concurrent.jsonl data/phase2/pilot_games_batch2.jsonl \
--output-dir data/phase2/supervised_full \
--max-games 0 \
--shard-rows 250000 \
--progress-every 500Train PPO baseline:
python scripts/train_ppo.py --updates 20 --steps-per-update 2048 --checkpoint checkpoints/ppo_latest.ptPhase-scoped PPO training for separate checkpoints:
python scripts/train_ppo.py --policy-scope in_out --checkpoint checkpoints/ppo_in_out_latest.pt
python scripts/train_ppo.py --policy-scope play --checkpoint checkpoints/ppo_play_latest.ptBy default, training also tracks and saves the best eval checkpoint at
checkpoints/ppo_best.pt. Disable with --no-track-best.
If you change evaluation regime substantially (for example switching from
full-policy eval semantics to scoped phase-only eval), write to a fresh
--best-checkpoint path to avoid stale best-threshold carryover.
Seat randomization (recommended):
python scripts/train_ppo.py --seat-mode random --learner-seats 0,1,2,3,4Deterministic seat balancing (useful for stability diagnostics):
python scripts/train_ppo.py --seat-mode cycle --learner-seats 0,1,2,3,4Phase-head PPO training (new track):
python scripts/train_ppo.py \
--model-variant phase_heads \
--seat-mode random \
--learner-seats 0,1,2,3,4 \
--opponent-mix heuristic:0.6,random:0.2,self:0.2 \
--entropy-coef 0.01 \
--entropy-coef-bid 0.02 \
--eval-stochastic-episodes 20 \
--entropy-coef-play 0.005 \
--hand-strength-signal-coef 0.05Heuristic-majority league training (recommended robust default):
python scripts/train_ppo.py \
--policy-scope play \
--opponent-mix-preset heuristic_majority \
--opponent-checkpoint-dir experiments/opponent_league \
--opponent-checkpoint-limit 12 \
--snapshot-dir experiments/opponent_league \
--snapshot-prefix play_league \
--snapshot-every-updates 5 \
--snapshot-max-keep 30Conservative phase-head evaluation gates:
python scripts/evaluate_ppo_phase_heads.py \
--candidate-checkpoint checkpoints/ppo_best.pt \
--baseline-checkpoint experiments/ppo_sweep_2026-02-26/cfg3_best.pt \
--games-per-matchup 2000 \
--opening-bid-games 500 \
--min-heuristic-win-rate 0.20 \
--output experiments/ppo_phase_heads/eval_report.jsonDedicated bidding pipeline (recommended next for bidding collapse):
# 1) Train bid-only supervised model (with optional temperature calibration)
python scripts/train_bid_supervised.py \
--bid-data data/phase2/supervised_full/bid_decisions/*.parquet data/phase2/supervised_v2_delta/bid_decisions/*.parquet \
--output experiments/bid_supervised_v1/bid_model.pt \
--summary-out experiments/bid_supervised_v1/summary.json \
--class-balance inverse_sqrt \
--calibrate-temperature
# 2) Warm-start dedicated bid-only PPO from supervised bid model
python scripts/train_bid_policy.py \
--model-variant phase_heads \
--stable-policy ppo \
--stable-ppo-checkpoint experiments/ppo_sweep_2026-02-26/cfg3_best.pt \
--warm-start-bid-supervised experiments/bid_supervised_v1/bid_model.pt \
--bid-anchor-supervised experiments/bid_supervised_v1/bid_model.pt \
--bid-anchor-coef 0.03 \
--bid-anchor-temperature 0.8 \
--hand-strength-signal-coef 0.05 \
--checkpoint experiments/bid_ppo_v1/latest.pt \
--best-checkpoint experiments/bid_ppo_v1/best.pt \
--metrics-out experiments/bid_ppo_v1/metrics.jsonl
# 3) Evaluate bid-tuned PPO checkpoint with conservative gate script
python scripts/evaluate_ppo_phase_heads.py \
--candidate-checkpoint experiments/bid_ppo_v1/best.pt \
--candidate-fallback-checkpoint experiments/ppo_sweep_2026-02-26/cfg3_best.pt \
--baseline-checkpoint experiments/ppo_sweep_2026-02-26/cfg3_best.pt \
--games-per-matchup 2000 \
--opening-bid-games 500 \
--output experiments/bid_ppo_v1/eval.json
# 4) Evaluate bid-policy quality/calibration on held-out bid states
python scripts/evaluate_bid_policy_quality.py \
--candidate-checkpoint experiments/bid_ppo_v1/best.pt \
--sample-rows 300000 \
--output experiments/bid_ppo_v1/bid_quality.jsonStability-first bid pipeline (fixed baseline + fixed eval seeds + burst training + deterministic promotion):
python scripts/run_bid_stability_pipeline.py \
--base-checkpoint experiments/bid_ppo_v6_handstrength_2026-02-27/hs_cfg1_best.pt \
--baseline-checkpoint experiments/ppo_sweep_2026-02-26/cfg3_best.pt \
--stable-ppo-checkpoint experiments/ppo_sweep_2026-02-26/cfg3_best.pt \
--bursts 8 \
--burst-updates 8 \
--early-stop-patience-updates 4 \
--early-stop-warmup-updates 4 \
--early-stop-min-delta 0.002The default promotion gate set now requires:
non_regression_vs_baselineseat_stabilityheuristic_strength
The pipeline writes:
summary.jsonwith burst-by-burst metrics + promotion decisions.REPORT.mdwith a compact comparison table.promoted/promoted_latest.ptas the current promoted checkpoint.
Bundle-first scoped league pipeline:
python scripts/run_phase_league_pipeline.py \
--python-exec /opt/anaconda3/envs/schnarps/bin/python \
--experiment-dir experiments/phase_league_2026-03-10_r1 \
--fallback-bundle experiments/long_explore_v1_2026-03-04/policy_bundle_long_v1.json \
--baseline-bundle experiments/long_explore_v1_2026-03-04/policy_bundle_long_v1.json \
--extra-opponent-bundles experiments/parallel_triple_v1_2026-03-05/policy_bundle_selected.json \
--run-bid-rollout-refreshThe pipeline:
- trains
in_outandplayagainst routed bundle opponents viastrong_bundle_league - evaluates the resulting candidate bundle against heuristic plus strong bundle opponents
- refreshes bid rollout labels only if the candidate bundle clears the required strength gates
Transfer-focused in_out curriculum (current recommended RL training path):
python scripts/run_in_out_transfer_curriculum.py \
--python-exec /opt/anaconda3/envs/schnarps/bin/python \
--experiment-dir experiments/in_out_transfer_curriculum_2026-03-11_r1The pipeline:
- runs heuristic-anchor bursts first to recover isolated
in_outsignal - switches to learned-partner bursts using routed bundle fallbacks and strong-bundle opponents
- validates every burst with routed fixed-partner bundle comparison
- promotes only when the candidate clears the transfer gates from
docs/EVAL_PROTOCOL_V3.md
Current strategy reset rules:
bid: prefer rollout-supervised training, and select checkpoints from routed external evalin_out: active RL scope viarun_in_out_transfer_curriculum.pyplay: current PPO loop is paused until its reward/curriculum is redesigned
Heuristic-backed scoped bootstrap with routed comparison reports:
python scripts/run_scoped_heuristic_bootstrap.py \
--python-exec /opt/anaconda3/envs/schnarps/bin/python \
--experiment-dir experiments/scoped_heuristic_bootstrap_2026-03-10_r1The pipeline:
- trains
bid,in_out, andplaywith heuristic partners to reduce cross-module poisoning - builds routed incumbent and candidate bundles for each scope
- writes scope-specific routed comparison reports against the routed incumbent plus strong extra bundles
When evaluating a scoped checkpoint directly (for example a bid-only PPO model),
pass a full fallback via --candidate-fallback-checkpoint so only the owned
phase comes from the candidate.
Early-stop controls are also available directly in:
scripts/train_bid_policy.pyscripts/train_ppo.py
Train supervised advisor models (bid/move):
python scripts/train_supervised_v1.py \
--bid-data data/phase2/supervised_full/bid_decisions/*.parquet data/phase2/supervised_v2_delta/bid_decisions/*.parquet \
--move-data data/phase2/supervised_full/move_decisions/*.parquet data/phase2/supervised_v2_delta/move_decisions/*.parquet \
--output-dir experiments/supervised_v3Seat-balanced + seed-paired advisor evaluation:
python scripts/evaluate_supervised_seat_balanced.py \
--bid-model experiments/supervised_v3/bid_model.pt \
--move-model experiments/supervised_v3/move_model.pt \
--games-per-matchup 5000 \
--output experiments/supervised_v3/eval_seat_balanced_5k.jsonBuild Phase-4 vertical-slice payload:
python scripts/build_phase4_vertical_slice.py \
--advisor-policy ppo \
--policy-bundle docs/examples/policy_bundle_v1.json \
--ppo-value-temperature 2.0 \
--risk-profile balanced \
--output apps/tauri-ui/mock/vertical_slice.jsonRun local Phase-4 bridge + interactive UI:
python scripts/run_phase4_bridge.py \
--advisor-policy ppo \
--policy-bundle docs/examples/policy_bundle_v1.json \
--opponent-policy heuristic \
--risk-profile balanced \
--static-dir apps/tauri-ui/mock \
--port 8765Latency/stability benchmark for advisor inference:
python scripts/benchmark_advisor_latency.py \
--advisor-policy ppo \
--policy-bundle docs/examples/policy_bundle_v1.json \
--output experiments/phase4_alpha/latency_report_ppo.jsonPPO-teacher distillation + calibration (Phase-3 refinement):
python scripts/train_distilled_from_ppo.py \
--teacher-checkpoint experiments/bid_ppo_v6_handstrength_2026-02-27/hs_cfg1_best.pt \
--output-dir experiments/supervised_distill_v1
python scripts/calibrate_supervised_temperature.py \
--bid-checkpoint experiments/supervised_distill_v1/bid_model.pt \
--move-checkpoint experiments/supervised_distill_v1/move_model.pt \
--output-dir experiments/supervised_distill_v1/calibratedCompute formal Phase-4 go/no-go decision:
python scripts/phase4_go_no_go.py \
--canonical-summary data/phase2/supervised_v2_merged_summary.json \
--train-summary experiments/supervised_v3/summary.json \
--eval-summary experiments/supervised_v3/eval_seat_balanced_5k.json \
--calibration-summary experiments/supervised_distill_v1/calibrated/summary.json \
--output experiments/supervised_v3/phase4_go_no_go.jsonschnarps/engine: deterministic rules engineschnarps/bots: baseline policiesschnarps/rl: self-play / evaluation scaffoldingschnarps/data: logging, replay, exportschnarps/advisor: recommendation API contractapps/tauri-ui: UI phase placeholdertests: rule and determinism tests
New architecture/reference docs:
docs/ARCHITECTURE_RL.mddocs/POLICY_BUNDLE_SPEC.mddocs/TRAINING_PHASED_3POLICY.mddocs/PROBABILITY_MODEL.mddocs/EVAL_PROTOCOL_V3.mddocs/EVAL_PROTOCOL_V2.md
The engine follows Rules.md plus the explicit planning decisions finalized in chat:
- 52-card deck, 5 cards per active player
- Bids are
passor integer1..5 - All-pass forces dealer to bid
1 - High bidder always IN
- Forced-IN when trump is spades or high bid is
1 - Punt overrides trick-based reduction
- Sitting at score
<=5adds+1 - Immediate win at score
<=0; elimination at>=32 - Dealer rotates +1 seat, skipping eliminated players