Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
260 changes: 260 additions & 0 deletions finetuning/SWIN/CONCRETE_RUN_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,260 @@
# Concrete next run — SWIN-L 384 (Tier 1 & 2)

Implements the "Putting it together — a concrete next run" recipe from
`SWIN_training_setup_summary.md` (§10 recommendations, §11 step-by-step), as a single
strong single-model candidate plus the code features it needs.

## What this run is

| Lever | Choice | Section |
|-------|--------|---------|
| Backbone | `microsoft/swin-large-patch4-window12-384-in22k` (stay in SWIN) | 1.1 / 1.2 |
| Resolution | 384 (processor-driven, no resize code) | 1.2 |
| Loss | balanced-softmax CE + multi-task family/genus/species heads | 1.3-A |
| Schedule | 100 epochs, 5% warmup, cosine, **EMA on** | 2.6 |
| Augmentation | MEDIUM (RandAug mag 7, mild mixup/erasing) — cold-start safe | 2.5 |
| Inference | multi-crop + flip TTA wired, enabled only for final prediction | 2.7 |

Config: [`configs_advanced/swin_large_384_concrete.yml`](configs_advanced/swin_large_384_concrete.yml)

## Code features added to `SWIN_finetuning_advanced.py`

All are config-gated and default **off**, so existing configs behave exactly as before.

1. **Balanced softmax / logit adjustment (Tier 1.3-A)** — new `long_tail` section.
A per-class `log_prior` (log training frequency, in the species/CE head's index
space) is added to the species logits **during training only** (`logits + tau*log_prior`),
then plain argmax at inference. Down-weights head classes to lift macro-F1 on the long
tail. Applied in `MixupTrainer.compute_loss` for single-task and multi-task, in both the
mixup and non-mixup paths. Not applied to ArcFace.
```yaml
long_tail:
logit_adjustment: true
tau: 1.0 # strength; 1.0 = standard balanced softmax
```

2. **Weight EMA (Tier 2.6)** — new `ema` section + `EMACallback`.
Maintains a shadow average of the parameters (`shadow = decay*shadow + (1-decay)*param`
every step) and copies it into the model at train end, so the final `evaluate()` and
`save_model()` reflect EMA weights. **Keep `load_best_model_at_end: false`** — the
best-checkpoint reload would otherwise be overwritten by the EMA copy.
```yaml
ema:
enabled: true
decay: 0.9998
```

3. **Horizontal-flip TTA (Tier 2.7)** — `multi_crop.flip`.
`build_multi_crop_transforms(..., flip=True)` also emits a flipped variant of each crop,
so logits average over crops × {orig, flip}. Leave `multi_crop.enabled: false` during
training; enable it for the final/leaderboard prediction only.

4. **Gradient-checkpointing passthrough** — `MultiTaskSwinModel` / `SwinWithArcFace` now
forward `gradient_checkpointing_enable/disable` to the backbone, so
`training.gradient_checkpointing: true` works for the wrapped models (needed to fit
SWIN-L @384 on one GPU).

## Environment setup (one-time)

Jobs run via `train_advanced.sh`, which loads:

```bash
module load miniconda
module load academic-ml/spring-2026
conda activate spring-2026-pyt
```

`spring-2026-pyt` already provides torch 2.9.1, transformers 4.57.3 (≥4.52, required),
datasets, accelerate, safetensors, torchvision, scikit-learn, pillow, pyyaml, numpy. Two
packages it does **not** include are needed by the trainer — install them once into your
user-site:

```bash
module load miniconda && conda activate spring-2026-pyt
pip install --user evaluate wandb
```

Notes:
- `evaluate` is required (accuracy / macro-F1); `wandb` is needed because the configs use
`report_to: wandb`. Set `--set training.report_to=none` (or `wandb.enabled: false`) to skip W&B.
- If `import wandb` fails with `cannot import name 'validate_core_schema' from 'pydantic_core'`,
the `--user` install shadowed the env's `pydantic_core`. Remove the duplicate so the env's
copy is used again:
`rm -rf ~/.local/lib/python3.12/site-packages/pydantic_core ~/.local/lib/python3.12/site-packages/pydantic_core-*.dist-info`
- `evaluate.load(...)` downloads its metric script from the HF hub on first use and caches it
under `~/.cache/huggingface`. Run the smoke test (below) once from a login node to warm the
cache if your compute nodes can't reach the hub.
- The PyTorch env requires `gpu_c >= 7.0`; the submit scripts request `gpu_c=8.0` (A100), so OK.

Sanity check the env:
```bash
python -c "import torch, transformers, datasets, evaluate, wandb; print('env OK')"
```

### Weights & Biases (logs to gardoslab / herbdl)

The trainer calls `wandb.init(entity="gardoslab", project="herbdl", name=run_name,
group=run_group, id=run_id, ...)` straight from the config (see
`SWIN_finetuning_advanced.py`), so no code change is needed — you only need team membership
+ a valid API key.

1. **Be a member of the `gardoslab` team.** Open <https://wandb.ai/gardoslab> while signed in.
If you can't see it, ask the team owner to invite your W&B username. `entity="gardoslab"`
fails with a permission error until you're a member — being logged in is not enough.

2. **Authenticate on SCC** (login node; `~/.netrc` is shared, so compute-node jobs reuse it —
no per-job login). Grab your key from <https://wandb.ai/authorize>:
```bash
module load miniconda && conda activate spring-2026-pyt
wandb login --relogin # paste key; --relogin replaces a stale key
```

3. **Verify** (the stored key can be stale even though `~/.netrc` exists):
```bash
wandb login --verify
python -c "import wandb; v=wandb.Api().viewer; print(v.username, '| teams:', v.teams)"
```
`gardoslab` should appear in `teams`.

Notes:
- Alternative to `~/.netrc`: `export WANDB_API_KEY=<key>` in your shell profile (keeps the
key out of any committed script).
- The seed loop in `submit_concrete.sh` sets a distinct `run_id`/`run_name` per seed, so seeds
appear as separate runs grouped under `SWIN_L_384_Concrete`.
- To skip W&B for a run: `--set training.report_to=none` (or `wandb.enabled: false`).
- If a compute node can't reach W&B: `export WANDB_MODE=offline`, then `wandb sync <run_dir>` later.

## How to launch (you run this — nothing is auto-submitted)

Single run (seed 0):
```bash
cd finetuning/SWIN
SEEDS="0" bash submit_concrete.sh
```

3- or 5-seed ensemble:
```bash
bash submit_concrete.sh # seeds 0 1 2
SEEDS="0 1 2 3 4" bash submit_concrete.sh
```

Each job requests 1 A100-80G GPU on `herbdl` for 48h and writes to
`finetuning/output/SWIN/SWIN_L_384_CONCRETE_SEED<seed>/`. Adjust the `-M` email in
`submit_concrete.sh` if needed.

### Smoke test first (recommended)
Verify the pipeline end-to-end cheaply before committing 48h jobs:
```bash
qsub -l h_rt=2:00:00 -pe omp 8 -P herbdl -l gpus=1 -l gpu_c=8.0 -l gpu_memory=80G \
-N SWINL384_SMOKE \
-v CONFIG_FILE=configs_advanced/swin_large_384_concrete.yml \
-v SET_ARGS="--set data.max_train_samples=2000 --set data.max_eval_samples=2000 --set training.num_train_epochs=1 --set training.output_dir=/projectnb/herbdl/workspaces/tgardos/herbdl/finetuning/output/SWIN/SMOKE --set training.overwrite_output_dir=true --set wandb.enabled=false" \
train_advanced.sh
```

## Output paths auto-relocate to your workspace

Most configs in this repo (inherited from faridkar's) hardcode `output_dir`/`logging_dir`
under `/projectnb/herbdl/workspaces/faridkar/herbdl/...`. The trainer rewrites any
`.../workspaces/<author>/herbdl` prefix to the repo you actually run from, preserving the
trailing run name — so a `tgardos` checkout writes to
`/projectnb/herbdl/workspaces/tgardos/herbdl/finetuning/output/SWIN/<NAME>` automatically,
with no YAML edits. It logs the rewrite (`__CUSTOM__: Relocated output path ...`). Set
`HERBDL_NO_RELOCATE=1` to disable (e.g. to write somewhere else via an explicit path).

## Warm-start (Tier 2.5 — recommended once a 384 checkpoint exists)

Cold-from-in22k is the dependency-free default. The curriculum finding is that chaining a
hard change from a converged checkpoint beats cold-starting it. Once you have a converged
SWIN-L 384 run, chain from it (keep `config_name`/`image_processor_name` on the 384 arch)
and raise `augmentation.randaugment.magnitude` to 9:
```bash
CKPT=/projectnb/herbdl/workspaces/tgardos/herbdl/finetuning/output/SWIN/SWIN_L_384_CONCRETE_SEED0 \
SEEDS="1" bash submit_concrete.sh
```

## OOM / memory tuning

SWIN-L @384 is heavy. If a job OOMs, lower the per-device batch and raise grad-accum to
keep the effective batch (~128) constant, e.g. via `--set`:
```
--set training.per_device_train_batch_size=8 --set training.gradient_accumulation_steps=16
```
`gradient_checkpointing: true` is already on.

## Final prediction with TTA

For the leaderboard/final eval, enable TTA on the trained checkpoint:
```yaml
multi_crop:
enabled: true
crop_sizes: [400, 416, 448, 480, 512]
target_size: 384
flip: true
```
The trainer runs `multi_crop_evaluate` after the standard eval and prints averaged
accuracy + macro-F1 (`__CUSTOM__: Multi-crop eval ...`). Mirror the same crops/flip in
`prediction.py` / `kaggle_submission.py` so the submission matches the eval.

## Metrics

Both top-1 accuracy and macro-F1 are reported every epoch (`eval_accuracy` /
`eval_species_f1` for multi-task). Macro-F1 over the long tail is the number to watch
(Tier 0).

## Remote monitoring from phone / MacBook (Claude Code Remote Control)

To babysit a run (check `qstat`, read logs, tweak configs) from an iPhone or MacBook, use
Claude Code **Remote Control** — the `claude` process keeps running on the SCC login node
(full `/projectnb` + `qsub` access), and your phone/browser are just remote windows into it.
This is different from *Claude Code on the web*, whose cloud sandbox has **no** SCC access.

### Updating Claude Code on SCC (needed: ≥ 2.1.51 for Remote Control)

Claude Code here is installed as an npm **prefix** install and run via a shell alias:
```bash
alias claude='npx --prefix ~/claude-code claude'
```
Because of that, `claude update` does **not** work — it targets npm's global prefix, which
is the read-only shared module dir (`/share/pkg.8/.../spring-2026-pyt`). Update the copy the
alias actually uses instead:
```bash
module load miniconda && conda activate spring-2026-pyt # for a consistent node/npm
npm install --prefix ~/claude-code @anthropic-ai/claude-code@latest
npx --prefix ~/claude-code claude --version # confirm >= 2.1.51
```
Re-run that `npm install --prefix` line whenever you want to upgrade (don't use `claude update`).

### Starting a Remote Control session

Remote Control requires a **claude.ai subscription login (Pro/Max/Team/Enterprise) — API keys
are not supported**. On the SCC login node:
```bash
unset ANTHROPIC_API_KEY # if set, it blocks Remote Control
claude /login # choose the claude.ai option (not a Console API key)

tmux new -s claude-hpc # persistent: survives SSH disconnects
# inside tmux:
cd /projectnb/herbdl/workspaces/tgardos/herbdl
claude remote-control --name "HerbDL SWIN-L 384"
```
It prints a session URL and offers a QR code (press space). Detach with `Ctrl-b d`; Claude
keeps running.

- **iPhone:** Claude app → **Code** tab → pick "HerbDL SWIN-L 384" (or scan the QR).
- **MacBook:** open the session URL, or go to **claude.ai/code** and pick the session. For a
local terminal instead: `ssh -t scc1.bu.edu "tmux attach -t claude-hpc"`.

Notes:
- Keep Claude on the **login node** (lightweight coordinator); GPU training stays in `qsub`
jobs on compute nodes. Don't run training directly under Claude.
- Remote Control can **push a phone notification** when a long task finishes (enable via `/config`).
- Text commands (`/context`, `/usage`) work from mobile; interactive pickers (`/resume`, `/mcp`)
only from the local terminal.

## Deferred (next ensemble members)

Per the chosen scope, these are intentionally **not** in this run and remain available to
add later as additional ensemble members: domain-pretrained backbone swap (Tier 1.1
timm/open_clip loader), warmed-up ArcFace rescue (Tier 2.4), class-balanced sampler /
two-stage cRT (Tier 1.3-B), and +2021 data (Tier 2.8).
125 changes: 125 additions & 0 deletions finetuning/SWIN/CURRICULUM_REPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Curriculum Learning — Stage-by-Stage Impact Report

## Starting Point: SWIN_BASE_BASELINE

**What it is:** SWIN-Base (224px, ImageNet-22k pretrained), fine-tuned with standard CE loss, no augmentation beyond basic resizing/normalization, unfrozen backbone from the start.

**Result:** Peak F1 = **0.7454** @ epoch 47.8

**Interpretation:** Solid starting point. Slow convergence curve — model starts at 0.58 F1 and takes ~48 epochs to plateau. This is the reference to beat.

---

## Interlude: Standalone Augmentation Test (SWIN_BASE_224_AUGMENTED)

**What it added:** Heavy augmentation (RandAugment mag=9, Mixup α=0.8, CutMix α=1.0, RandomErasing 25%, label smoothing 0.1) applied directly from scratch — no warm-up, no curriculum.

**Result:** Peak F1 = **0.6118** @ epoch 44.4 — **worse than baseline by 3.4 points**

**Why it failed:** Throwing all regularization at a model cold is destructive. Strong Mixup/CutMix targets corrupt learning signal before the backbone has stabilized. The model oscillates and never recovers — note the flat 0.57–0.61 plateau from epoch 20–99. This is the key motivation for curriculum learning.

---

## Curriculum Stage 1 — Mild Augmentation Warm-up

**What changed:** Initialized from baseline checkpoint. RandAugment mag=4 (mild), Mixup α=0.8, CutMix α=1.0, RandomErasing p=0.1, label smoothing 0.05. LR = 5e-5.

**Result:** Peak F1 = **0.7214** @ epoch 23.9

**Interpretation:** Starts immediately at 0.69 F1 (baseline already baked in), reaches 0.72 in 24 epochs. The mild augmentation + lower LR successfully builds on the baseline without disrupting it. Notably, this run converges faster than the baseline — 0.69 at epoch 3 vs. 0.58 for baseline.

**Gain vs baseline at epoch 24:** +0.013 F1

---

## Curriculum Stage 2 — Medium Augmentation

**What changed:** From S1 checkpoint. RandAugment mag=7 (stepped up), RandomErasing p=0.15, label smoothing 0.1. LR = 3e-5.

**Result:** Peak F1 = **0.7421** @ epoch 27.3

**Gain vs S1:** +0.021 F1

**Interpretation:** The stepped-up augmentation is now helping rather than hurting, because the backbone is already warm. Model jumps to 0.72 at epoch 3 and climbs to 0.74 by epoch 27.

---

## Curriculum Stage 3 — Heavy Augmentation

**What changed:** From S2 checkpoint. RandAugment mag=9 (full strength), RandomErasing p=0.25. LR = 2e-5. 50 epochs.

**Result:** Peak F1 = **0.7510** @ epoch 41.0

**Gain vs S2:** +0.009 F1. Diminishing returns beginning.

**Interpretation:** Full augmentation now converges to a higher ceiling than baseline. However, the improvement margin is shrinking. The model starts at 0.74 immediately and creeps upward slowly — most gain is in early epochs, then it plateaus.

---

## Curriculum Stage 3-Cont — Extended Cosine Schedule

**What changed:** From S3 final model (not best checkpoint). Fresh cosine LR schedule restart from 2e-5. Same augmentation. Intended to push past the S3 plateau.

**Result:** Peak F1 = **0.7510** @ epoch 50.0

**Gain vs S3:** **+0.000 F1**

**Interpretation:** The LR restart did not help — S3 had already converged. The model stays in the same 0.74–0.75 band the entire 50 epochs. This suggests the 224px + CE + augmentation combination has hit its ceiling.

---

## Curriculum MultiTask — Auxiliary Family/Genus Heads

**What changed:** From S3-Cont final model. Added CE auxiliary heads for family and genus (weights 0.2×family + 0.3×genus + 1.0×species). Mixup/CutMix retained. LR = 3e-4 (higher — new heads need to train). 100 epochs.

**Result:** Peak F1 = **0.7523** @ epoch 68.3

**Gain vs S3-Cont:** +0.001 F1 net, but with a very different trajectory.

**Key observation:** The new family/genus heads start randomly initialized → eval_on_start near-zero → slow recovery through ~40 epochs before exceeding S3-Cont. MultiTask eventually pulls ahead but the improvement is modest. The multi-task signal is providing regularization but not a dramatic accuracy boost on its own.

---

## Curriculum ArcFace — SubCenter ArcFace Metric Learning

**What changed:** From MultiTask checkpoint. Replaced CE species head with SubCenter ArcFace (embedding=512, scale=30, margin=0.5, k=3 sub-centers). Mixup/CutMix disabled (incompatible with hard labels). Hybrid CE weight = 0.0. LR = 1e-4. 60 epochs.

**Result:** Peak F1 = **0.7376** @ epoch 58.1

**Gain vs MultiTask:** **–0.015 F1** — a regression.

**Interpretation:** ArcFace starts from near-zero (random embedding + weight matrix initialization), takes ~40 epochs just to recover to MultiTask's level, and peaks 1.5% *below* the MultiTask checkpoint it started from. The loss function change required too many epochs to re-learn what CE had already learned. The 60-epoch budget was insufficient for ArcFace to amortize its warm-up cost and then improve further.

---

## Summary Table

| Stage | Technique Added | Peak F1 | Δ vs Previous | Epochs to Peak |
|-------|----------------|---------|--------------|----------------|
| Baseline | CE, no augmentation | 0.7454 | — | 47.8 |
| Aug (standalone) | Heavy aug, no curriculum | 0.6118 | –0.034 | 44.4 |
| S1 | Mild aug (warm-up) | 0.7214 | –0.024* | 23.9 |
| S2 | Medium aug | 0.7421 | +0.021 | 27.3 |
| S3 | Heavy aug | 0.7510 | +0.009 | 41.0 |
| S3-Cont | LR restart | 0.7510 | +0.000 | 50.0 |
| MultiTask | Family/genus aux heads | 0.7523 | +0.001 | 68.3 |
| ArcFace | Metric learning loss | 0.7376 | **–0.015** | 58.1 |

\* S1 starts below baseline because it used fewer epochs (25 vs. 48 for baseline). Chaining S1→S2→S3 ultimately exceeds the baseline ceiling (0.751 vs. 0.745).

---

## Key Takeaways

1. **Curriculum ordering matters critically.** Applying heavy augmentation cold destroyed performance (0.61). Applied progressively, it exceeds baseline (0.751 vs. 0.745).

2. **The aug curriculum plateau is around 0.750–0.752.** S3, S3-Cont, and MultiTask all peak in this band. The 224px CE model appears structurally capped here.

3. **MultiTask gave only marginal gain (+0.001).** The auxiliary signal helps slightly but the species task already dominates. More useful as regularization than as a direct accuracy booster.

4. **ArcFace regressed.** The 60-epoch budget was too short — ArcFace requires a long cold-start recovery period before it can outperform CE. The hybrid/384 stages queued after it will inherit this disadvantage.

5. **The gap to 0.80 is still ~5 points.** The most promising levers remaining are:
- **384px resolution** — larger receptive field is known to help fine-grained recognition
- **SWIN V2 architecture** — updated relative position bias and scaled cosine attention
- **Revisiting ArcFace** with a longer budget or frozen-backbone warm-up phase
Loading