gardoslab · Farid-Karimli · Jun 8, 2026 · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026
diff --git a/finetuning/SWIN/CONCRETE_RUN_README.md b/finetuning/SWIN/CONCRETE_RUN_README.md
@@ -0,0 +1,260 @@
+# Concrete next run — SWIN-L 384 (Tier 1 & 2)
+
+Implements the "Putting it together — a concrete next run" recipe from
+`SWIN_training_setup_summary.md` (§10 recommendations, §11 step-by-step), as a single
+strong single-model candidate plus the code features it needs.
+
+## What this run is
+
+| Lever | Choice | Section |
+|-------|--------|---------|
+| Backbone | `microsoft/swin-large-patch4-window12-384-in22k` (stay in SWIN) | 1.1 / 1.2 |
+| Resolution | 384 (processor-driven, no resize code) | 1.2 |
+| Loss | balanced-softmax CE + multi-task family/genus/species heads | 1.3-A |
+| Schedule | 100 epochs, 5% warmup, cosine, **EMA on** | 2.6 |
+| Augmentation | MEDIUM (RandAug mag 7, mild mixup/erasing) — cold-start safe | 2.5 |
+| Inference | multi-crop + flip TTA wired, enabled only for final prediction | 2.7 |
+
+Config: [`configs_advanced/swin_large_384_concrete.yml`](configs_advanced/swin_large_384_concrete.yml)
+
+## Code features added to `SWIN_finetuning_advanced.py`
+
+All are config-gated and default **off**, so existing configs behave exactly as before.
+
+1. **Balanced softmax / logit adjustment (Tier 1.3-A)** — new `long_tail` section.
+   A per-class `log_prior` (log training frequency, in the species/CE head's index
+   space) is added to the species logits **during training only** (`logits + tau*log_prior`),
+   then plain argmax at inference. Down-weights head classes to lift macro-F1 on the long
+   tail. Applied in `MixupTrainer.compute_loss` for single-task and multi-task, in both the
+   mixup and non-mixup paths. Not applied to ArcFace.
+   ```yaml
+   long_tail:
+     logit_adjustment: true
+     tau: 1.0          # strength; 1.0 = standard balanced softmax
+   ```
+
+2. **Weight EMA (Tier 2.6)** — new `ema` section + `EMACallback`.
+   Maintains a shadow average of the parameters (`shadow = decay*shadow + (1-decay)*param`
+   every step) and copies it into the model at train end, so the final `evaluate()` and
+   `save_model()` reflect EMA weights. **Keep `load_best_model_at_end: false`** — the
+   best-checkpoint reload would otherwise be overwritten by the EMA copy.
+   ```yaml
+   ema:
+     enabled: true
+     decay: 0.9998
+   ```
+
+3. **Horizontal-flip TTA (Tier 2.7)** — `multi_crop.flip`.
+   `build_multi_crop_transforms(..., flip=True)` also emits a flipped variant of each crop,
+   so logits average over crops × {orig, flip}. Leave `multi_crop.enabled: false` during
+   training; enable it for the final/leaderboard prediction only.
+
+4. **Gradient-checkpointing passthrough** — `MultiTaskSwinModel` / `SwinWithArcFace` now
+   forward `gradient_checkpointing_enable/disable` to the backbone, so
+   `training.gradient_checkpointing: true` works for the wrapped models (needed to fit
+   SWIN-L @384 on one GPU).
+
+## Environment setup (one-time)
+
+Jobs run via `train_advanced.sh`, which loads:
+
+```bash
+module load miniconda
+module load academic-ml/spring-2026
+conda activate spring-2026-pyt
+```
+
+`spring-2026-pyt` already provides torch 2.9.1, transformers 4.57.3 (≥4.52, required),
+datasets, accelerate, safetensors, torchvision, scikit-learn, pillow, pyyaml, numpy. Two
+packages it does **not** include are needed by the trainer — install them once into your
+user-site:
+
+```bash
+module load miniconda && conda activate spring-2026-pyt
+pip install --user evaluate wandb
+```
+
+Notes:
+- `evaluate` is required (accuracy / macro-F1); `wandb` is needed because the configs use
+  `report_to: wandb`. Set `--set training.report_to=none` (or `wandb.enabled: false`) to skip W&B.
+- If `import wandb` fails with `cannot import name 'validate_core_schema' from 'pydantic_core'`,
+  the `--user` install shadowed the env's `pydantic_core`. Remove the duplicate so the env's
+  copy is used again:
+  `rm -rf ~/.local/lib/python3.12/site-packages/pydantic_core ~/.local/lib/python3.12/site-packages/pydantic_core-*.dist-info`
+- `evaluate.load(...)` downloads its metric script from the HF hub on first use and caches it
+  under `~/.cache/huggingface`. Run the smoke test (below) once from a login node to warm the
+  cache if your compute nodes can't reach the hub.
+- The PyTorch env requires `gpu_c >= 7.0`; the submit scripts request `gpu_c=8.0` (A100), so OK.
+
+Sanity check the env:
+```bash
+python -c "import torch, transformers, datasets, evaluate, wandb; print('env OK')"
+```
+
+### Weights & Biases (logs to gardoslab / herbdl)
+
+The trainer calls `wandb.init(entity="gardoslab", project="herbdl", name=run_name,
+group=run_group, id=run_id, ...)` straight from the config (see
+`SWIN_finetuning_advanced.py`), so no code change is needed — you only need team membership
++ a valid API key.
+
+1. **Be a member of the `gardoslab` team.** Open <https://wandb.ai/gardoslab> while signed in.
+   If you can't see it, ask the team owner to invite your W&B username. `entity="gardoslab"`
+   fails with a permission error until you're a member — being logged in is not enough.
+
+2. **Authenticate on SCC** (login node; `~/.netrc` is shared, so compute-node jobs reuse it —
+   no per-job login). Grab your key from <https://wandb.ai/authorize>:
+   ```bash
+   module load miniconda && conda activate spring-2026-pyt
+   wandb login --relogin        # paste key; --relogin replaces a stale key
+   ```
+
+3. **Verify** (the stored key can be stale even though `~/.netrc` exists):
+   ```bash
+   wandb login --verify
+   python -c "import wandb; v=wandb.Api().viewer; print(v.username, '| teams:', v.teams)"
+   ```
+   `gardoslab` should appear in `teams`.
+
+Notes:
+- Alternative to `~/.netrc`: `export WANDB_API_KEY=<key>` in your shell profile (keeps the
+  key out of any committed script).
+- The seed loop in `submit_concrete.sh` sets a distinct `run_id`/`run_name` per seed, so seeds
+  appear as separate runs grouped under `SWIN_L_384_Concrete`.
+- To skip W&B for a run: `--set training.report_to=none` (or `wandb.enabled: false`).
+- If a compute node can't reach W&B: `export WANDB_MODE=offline`, then `wandb sync <run_dir>` later.
+
+## How to launch (you run this — nothing is auto-submitted)
+
+Single run (seed 0):
+```bash
+cd finetuning/SWIN
+SEEDS="0" bash submit_concrete.sh
+```
+
+3- or 5-seed ensemble:
+```bash
+bash submit_concrete.sh                 # seeds 0 1 2
+SEEDS="0 1 2 3 4" bash submit_concrete.sh
+```
+
+Each job requests 1 A100-80G GPU on `herbdl` for 48h and writes to
+`finetuning/output/SWIN/SWIN_L_384_CONCRETE_SEED<seed>/`. Adjust the `-M` email in
+`submit_concrete.sh` if needed.
+
+### Smoke test first (recommended)
+Verify the pipeline end-to-end cheaply before committing 48h jobs:
+```bash
+qsub -l h_rt=2:00:00 -pe omp 8 -P herbdl -l gpus=1 -l gpu_c=8.0 -l gpu_memory=80G \
+     -N SWINL384_SMOKE \
+     -v CONFIG_FILE=configs_advanced/swin_large_384_concrete.yml \
+-v SET_ARGS="--set data.max_train_samples=2000 --set data.max_eval_samples=2000 --set training.num_train_epochs=1 --set training.output_dir=/projectnb/herbdl/workspaces/tgardos/herbdl/finetuning/output/SWIN/SMOKE --set training.overwrite_output_dir=true --set wandb.enabled=false" \
+     train_advanced.sh
+```
+
+## Output paths auto-relocate to your workspace
+
+Most configs in this repo (inherited from faridkar's) hardcode `output_dir`/`logging_dir`
+under `/projectnb/herbdl/workspaces/faridkar/herbdl/...`. The trainer rewrites any
+`.../workspaces/<author>/herbdl` prefix to the repo you actually run from, preserving the
+trailing run name — so a `tgardos` checkout writes to
+`/projectnb/herbdl/workspaces/tgardos/herbdl/finetuning/output/SWIN/<NAME>` automatically,
+with no YAML edits. It logs the rewrite (`__CUSTOM__: Relocated output path ...`). Set
+`HERBDL_NO_RELOCATE=1` to disable (e.g. to write somewhere else via an explicit path).
+
+## Warm-start (Tier 2.5 — recommended once a 384 checkpoint exists)
+
+Cold-from-in22k is the dependency-free default. The curriculum finding is that chaining a
+hard change from a converged checkpoint beats cold-starting it. Once you have a converged
+SWIN-L 384 run, chain from it (keep `config_name`/`image_processor_name` on the 384 arch)
+and raise `augmentation.randaugment.magnitude` to 9:
+```bash
+CKPT=/projectnb/herbdl/workspaces/tgardos/herbdl/finetuning/output/SWIN/SWIN_L_384_CONCRETE_SEED0 \
+    SEEDS="1" bash submit_concrete.sh
+```
+
+## OOM / memory tuning
+
+SWIN-L @384 is heavy. If a job OOMs, lower the per-device batch and raise grad-accum to
+keep the effective batch (~128) constant, e.g. via `--set`:
+```
+--set training.per_device_train_batch_size=8 --set training.gradient_accumulation_steps=16
+```
+`gradient_checkpointing: true` is already on.
+
+## Final prediction with TTA
+
+For the leaderboard/final eval, enable TTA on the trained checkpoint:
+```yaml
+multi_crop:
+  enabled: true
+  crop_sizes: [400, 416, 448, 480, 512]
+  target_size: 384
+  flip: true
+```
+The trainer runs `multi_crop_evaluate` after the standard eval and prints averaged
+accuracy + macro-F1 (`__CUSTOM__: Multi-crop eval ...`). Mirror the same crops/flip in
+`prediction.py` / `kaggle_submission.py` so the submission matches the eval.
+
+## Metrics
+
+Both top-1 accuracy and macro-F1 are reported every epoch (`eval_accuracy` /
+`eval_species_f1` for multi-task). Macro-F1 over the long tail is the number to watch
+(Tier 0).
+
+## Remote monitoring from phone / MacBook (Claude Code Remote Control)
+
+To babysit a run (check `qstat`, read logs, tweak configs) from an iPhone or MacBook, use
+Claude Code **Remote Control** — the `claude` process keeps running on the SCC login node
+(full `/projectnb` + `qsub` access), and your phone/browser are just remote windows into it.
+This is different from *Claude Code on the web*, whose cloud sandbox has **no** SCC access.
+
+### Updating Claude Code on SCC (needed: ≥ 2.1.51 for Remote Control)
+
+Claude Code here is installed as an npm **prefix** install and run via a shell alias:
+```bash
+alias claude='npx --prefix ~/claude-code claude'
+```
+Because of that, `claude update` does **not** work — it targets npm's global prefix, which
+is the read-only shared module dir (`/share/pkg.8/.../spring-2026-pyt`). Update the copy the
+alias actually uses instead:
+```bash
+module load miniconda && conda activate spring-2026-pyt   # for a consistent node/npm
+npm install --prefix ~/claude-code @anthropic-ai/claude-code@latest
+npx --prefix ~/claude-code claude --version               # confirm >= 2.1.51
+```
+Re-run that `npm install --prefix` line whenever you want to upgrade (don't use `claude update`).
+
+### Starting a Remote Control session
+
+Remote Control requires a **claude.ai subscription login (Pro/Max/Team/Enterprise) — API keys
+are not supported**. On the SCC login node:
+```bash
+unset ANTHROPIC_API_KEY          # if set, it blocks Remote Control
+claude /login                    # choose the claude.ai option (not a Console API key)
+
+tmux new -s claude-hpc           # persistent: survives SSH disconnects
+# inside tmux:
+cd /projectnb/herbdl/workspaces/tgardos/herbdl
+claude remote-control --name "HerbDL SWIN-L 384"
+```
+It prints a session URL and offers a QR code (press space). Detach with `Ctrl-b d`; Claude
+keeps running.
+
+- **iPhone:** Claude app → **Code** tab → pick "HerbDL SWIN-L 384" (or scan the QR).
+- **MacBook:** open the session URL, or go to **claude.ai/code** and pick the session. For a
+  local terminal instead: `ssh -t scc1.bu.edu "tmux attach -t claude-hpc"`.
+
+Notes:
+- Keep Claude on the **login node** (lightweight coordinator); GPU training stays in `qsub`
+  jobs on compute nodes. Don't run training directly under Claude.
+- Remote Control can **push a phone notification** when a long task finishes (enable via `/config`).
+- Text commands (`/context`, `/usage`) work from mobile; interactive pickers (`/resume`, `/mcp`)
+  only from the local terminal.
+
+## Deferred (next ensemble members)
+
+Per the chosen scope, these are intentionally **not** in this run and remain available to
+add later as additional ensemble members: domain-pretrained backbone swap (Tier 1.1
+timm/open_clip loader), warmed-up ArcFace rescue (Tier 2.4), class-balanced sampler /
+two-stage cRT (Tier 1.3-B), and +2021 data (Tier 2.8).
diff --git a/finetuning/SWIN/CURRICULUM_REPORT.md b/finetuning/SWIN/CURRICULUM_REPORT.md
@@ -0,0 +1,125 @@
+# Curriculum Learning — Stage-by-Stage Impact Report
+
+## Starting Point: SWIN_BASE_BASELINE
+
+**What it is:** SWIN-Base (224px, ImageNet-22k pretrained), fine-tuned with standard CE loss, no augmentation beyond basic resizing/normalization, unfrozen backbone from the start.
+
+**Result:** Peak F1 = **0.7454** @ epoch 47.8
+
+**Interpretation:** Solid starting point. Slow convergence curve — model starts at 0.58 F1 and takes ~48 epochs to plateau. This is the reference to beat.
+
+---
+
+## Interlude: Standalone Augmentation Test (SWIN_BASE_224_AUGMENTED)
+
+**What it added:** Heavy augmentation (RandAugment mag=9, Mixup α=0.8, CutMix α=1.0, RandomErasing 25%, label smoothing 0.1) applied directly from scratch — no warm-up, no curriculum.
+
+**Result:** Peak F1 = **0.6118** @ epoch 44.4 — **worse than baseline by 3.4 points**
+
+**Why it failed:** Throwing all regularization at a model cold is destructive. Strong Mixup/CutMix targets corrupt learning signal before the backbone has stabilized. The model oscillates and never recovers — note the flat 0.57–0.61 plateau from epoch 20–99. This is the key motivation for curriculum learning.
+
+---
+
+## Curriculum Stage 1 — Mild Augmentation Warm-up
+
+**What changed:** Initialized from baseline checkpoint. RandAugment mag=4 (mild), Mixup α=0.8, CutMix α=1.0, RandomErasing p=0.1, label smoothing 0.05. LR = 5e-5.
+
+**Result:** Peak F1 = **0.7214** @ epoch 23.9
+
+**Interpretation:** Starts immediately at 0.69 F1 (baseline already baked in), reaches 0.72 in 24 epochs. The mild augmentation + lower LR successfully builds on the baseline without disrupting it. Notably, this run converges faster than the baseline — 0.69 at epoch 3 vs. 0.58 for baseline.
+
+**Gain vs baseline at epoch 24:** +0.013 F1
+
+---
+
+## Curriculum Stage 2 — Medium Augmentation
+
+**What changed:** From S1 checkpoint. RandAugment mag=7 (stepped up), RandomErasing p=0.15, label smoothing 0.1. LR = 3e-5.
+
+**Result:** Peak F1 = **0.7421** @ epoch 27.3
+
+**Gain vs S1:** +0.021 F1
+
+**Interpretation:** The stepped-up augmentation is now helping rather than hurting, because the backbone is already warm. Model jumps to 0.72 at epoch 3 and climbs to 0.74 by epoch 27.
+
+---
+
+## Curriculum Stage 3 — Heavy Augmentation
+
+**What changed:** From S2 checkpoint. RandAugment mag=9 (full strength), RandomErasing p=0.25. LR = 2e-5. 50 epochs.
+
+**Result:** Peak F1 = **0.7510** @ epoch 41.0
+
+**Gain vs S2:** +0.009 F1. Diminishing returns beginning.
+
+**Interpretation:** Full augmentation now converges to a higher ceiling than baseline. However, the improvement margin is shrinking. The model starts at 0.74 immediately and creeps upward slowly — most gain is in early epochs, then it plateaus.
+
+---
+
+## Curriculum Stage 3-Cont — Extended Cosine Schedule
+
+**What changed:** From S3 final model (not best checkpoint). Fresh cosine LR schedule restart from 2e-5. Same augmentation. Intended to push past the S3 plateau.
+
+**Result:** Peak F1 = **0.7510** @ epoch 50.0
+
+**Gain vs S3:** **+0.000 F1**
+
+**Interpretation:** The LR restart did not help — S3 had already converged. The model stays in the same 0.74–0.75 band the entire 50 epochs. This suggests the 224px + CE + augmentation combination has hit its ceiling.
+
+---
+
+## Curriculum MultiTask — Auxiliary Family/Genus Heads
+
+**What changed:** From S3-Cont final model. Added CE auxiliary heads for family and genus (weights 0.2×family + 0.3×genus + 1.0×species). Mixup/CutMix retained. LR = 3e-4 (higher — new heads need to train). 100 epochs.
+
+**Result:** Peak F1 = **0.7523** @ epoch 68.3
+
+**Gain vs S3-Cont:** +0.001 F1 net, but with a very different trajectory.
+
+**Key observation:** The new family/genus heads start randomly initialized → eval_on_start near-zero → slow recovery through ~40 epochs before exceeding S3-Cont. MultiTask eventually pulls ahead but the improvement is modest. The multi-task signal is providing regularization but not a dramatic accuracy boost on its own.
+
+---
+
+## Curriculum ArcFace — SubCenter ArcFace Metric Learning
+
+**What changed:** From MultiTask checkpoint. Replaced CE species head with SubCenter ArcFace (embedding=512, scale=30, margin=0.5, k=3 sub-centers). Mixup/CutMix disabled (incompatible with hard labels). Hybrid CE weight = 0.0. LR = 1e-4. 60 epochs.
+
+**Result:** Peak F1 = **0.7376** @ epoch 58.1
+
+**Gain vs MultiTask:** **–0.015 F1** — a regression.
+
+**Interpretation:** ArcFace starts from near-zero (random embedding + weight matrix initialization), takes ~40 epochs just to recover to MultiTask's level, and peaks 1.5% *below* the MultiTask checkpoint it started from. The loss function change required too many epochs to re-learn what CE had already learned. The 60-epoch budget was insufficient for ArcFace to amortize its warm-up cost and then improve further.
+
+---
+
+## Summary Table
+
+| Stage | Technique Added | Peak F1 | Δ vs Previous | Epochs to Peak |
+|-------|----------------|---------|--------------|----------------|
+| Baseline | CE, no augmentation | 0.7454 | — | 47.8 |
+| Aug (standalone) | Heavy aug, no curriculum | 0.6118 | –0.034 | 44.4 |
+| S1 | Mild aug (warm-up) | 0.7214 | –0.024* | 23.9 |
+| S2 | Medium aug | 0.7421 | +0.021 | 27.3 |
+| S3 | Heavy aug | 0.7510 | +0.009 | 41.0 |
+| S3-Cont | LR restart | 0.7510 | +0.000 | 50.0 |
+| MultiTask | Family/genus aux heads | 0.7523 | +0.001 | 68.3 |
+| ArcFace | Metric learning loss | 0.7376 | **–0.015** | 58.1 |
+
+\* S1 starts below baseline because it used fewer epochs (25 vs. 48 for baseline). Chaining S1→S2→S3 ultimately exceeds the baseline ceiling (0.751 vs. 0.745).
+
+---
+
+## Key Takeaways
+
+1. **Curriculum ordering matters critically.** Applying heavy augmentation cold destroyed performance (0.61). Applied progressively, it exceeds baseline (0.751 vs. 0.745).
+
+2. **The aug curriculum plateau is around 0.750–0.752.** S3, S3-Cont, and MultiTask all peak in this band. The 224px CE model appears structurally capped here.
+
+3. **MultiTask gave only marginal gain (+0.001).** The auxiliary signal helps slightly but the species task already dominates. More useful as regularization than as a direct accuracy booster.
+
+4. **ArcFace regressed.** The 60-epoch budget was too short — ArcFace requires a long cold-start recovery period before it can outperform CE. The hybrid/384 stages queued after it will inherit this disadvantage.
+
+5. **The gap to 0.80 is still ~5 points.** The most promising levers remaining are:
+   - **384px resolution** — larger receptive field is known to help fine-grained recognition
+   - **SWIN V2 architecture** — updated relative position bias and scaled cosine attention
+   - **Revisiting ArcFace** with a longer budget or frozen-backbone warm-up phase