SWIN finetuning: domain-gap closing Tier 1/2 (balanced softmax, EMA, flip TTA) + SWIN-L 384 concrete run by trgardos · Pull Request #36 · gardoslab/herbdl

trgardos · 2026-06-04T16:34:37Z

Summary

Implements the Tier 1 & Tier 2 "concrete next run" from SWIN_training_setup_summary.md (§10 recommendations / §11 step-by-step) on top of faridkar's SWIN finetuning framework, scoped to stay within SWIN-L 384 (no backbone swap). Goal: close the macro-F1 gap on the ~15.5k-class long tail.

The branch is 5 commits, structured so the framework sync is isolated from the new work:

34045dd Sync framework from faridkar — brings this repo's finetuning/SWIN/ up to faridkar's current trainer (integrated ArcFace + non-backbone overlay, multi-task, mixup/cutmix, multi-crop TTA), plus configs_advanced/ (curriculum + seed ensemble), hyperparameter_configs/, launch_sweep.py, and reference docs. No tgardos edits in this commit — verbatim, so the rest of the PR is a clean diff.
c20acfa §10/§11 Tier 1–2 features + concrete config (see below).
956f6b5 Environment — use the spring-2026-pyt conda env (not faridkar's personal herb_env); document pip install --user evaluate wandb.
e1f946c W&B setup docs — gardoslab/herbdl logging.
6bd0636 Output-path auto-relocation + TrainingArguments fix.

New trainer features (all config-gated, default-off — existing configs unchanged)

Feature	Section	Config key
Balanced softmax / logit adjustment (species log-prior added to logits at train time only)	1.3-A	`long_tail.logit_adjustment` / `tau`
Weight EMA (shadow avg copied into model at train end → final eval/save use EMA)	2.6	`ema.enabled` / `decay`
Horizontal-flip TTA (avg over crops × {orig, flip})	2.7	`multi_crop.flip`
`gradient_checkpointing_enable/disable` passthrough on the multi-task/ArcFace wrappers	—	`training.gradient_checkpointing`

Balanced softmax is applied in MixupTrainer.compute_loss for single-task and multi-task, in both the mixup and non-mixup paths, and is off at inference (plain argmax) — the standard balanced-softmax recipe for lifting macro-F1 on rare classes.

The concrete run

configs_advanced/swin_large_384_concrete.yml — SWIN-L @384 (ImageNet-22k), multi-task + balanced softmax + EMA, medium (cold-start-safe) augmentation, 100 epochs, TTA block ready for inference.
submit_concrete.sh — A100-80G seed-ensemble launcher (distinct run_id/output_dir per seed via --set). Nothing is auto-submitted.
CONCRETE_RUN_README.md — run/feature docs, one-time env setup (pip + W&B), smoke test, warm-start, OOM tuning, TTA inference.

Fixes surfaced along the way

Output paths auto-relocate: configs hardcode .../workspaces/faridkar/herbdl/...; the trainer now rewrites any .../workspaces/<author>/herbdl prefix to the workspace it's actually run from (preserving the run name), so every config writes under the runner's own workspace with no YAML edits. HERBDL_NO_RELOCATE=1 opts out.
TrainingArguments allow-list was silently dropping gradient_checkpointing and load_best_model_at_end — now forwarded, so the 384 config's gradient checkpointing actually takes effect (it would otherwise OOM).

Deferred (future ensemble members, intentionally out of scope)

Domain-pretrained backbone swap (Tier 1.1 timm/open_clip), warmed-up ArcFace rescue (Tier 2.4), class-balanced sampler / two-stage cRT (Tier 1.3-B), and +2021 data (Tier 2.8).

Validation

py_compile + full module import under spring-2026-pyt; all configs_advanced/*.yml parse; --set override flow and output-path relocation unit-tested (faridkar→tgardos rewrite, local/unrelated paths untouched, opt-out honored). No GPU training run yet — recommend the README smoke test before launching the 48h jobs.

🤖 Generated with Claude Code

Snapshot of faridkar's current finetuning/SWIN before applying the section 10/11 (Tier 1 & 2) changes. Brings tgardos's copy up to date: - SWIN_finetuning_advanced.py: integrated ArcFace + non-backbone overlay, multi-task, mixup/cutmix, multi-crop TTA (now identical to faridkar's) - configs_advanced/: curriculum chain, seed ensemble, augmented variants - hyperparameter_configs/: 12-cell LR sweep - launch_sweep.py, submit_pretrained_seeds.sh, train_advanced.sh (SET_ARGS) - CURRICULUM_REPORT.md and other reference docs No tgardos-authored modifications yet; this is faridkar's code verbatim so the subsequent section-11 work is a clean, reviewable diff. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@384

… run Implements the "concrete next run" from SWIN_training_setup_summary.md (§10 recommendations / §11 step-by-step), scoped to stay within SWIN-L 384. Trainer (SWIN_finetuning_advanced.py), all config-gated and default-off so existing configs are unchanged: - Balanced softmax / logit adjustment (Tier 1.3-A): new `long_tail` section; per-class log-prior added to species logits during training only (off at inference), in single-task and multi-task, mixup and non-mixup paths. - Weight EMA (Tier 2.6): new `ema` section + EMACallback; EMA weights copied into the model at train end so final eval/save reflect them. - Horizontal-flip TTA (Tier 2.7): build_multi_crop_transforms(flip=...) emits flipped crop variants; gated by `multi_crop.flip`. - gradient_checkpointing_enable/disable passthrough on MultiTaskSwinModel and SwinWithArcFace so the flag works for the wrapped models (fits SWIN-L @384). Experiment: - configs_advanced/swin_large_384_concrete.yml — SWIN-L 384 in22k, multi-task + balanced softmax + EMA, medium (cold-safe) aug, 100ep, TTA block. - submit_concrete.sh — A100-80G seed-ensemble launcher (no auto-submit). - CONCRETE_RUN_README.md — run/feature docs, smoke test, warm-start, TTA. Deferred per chosen scope: backbone swap (1.1), ArcFace rescue (2.4), class-balanced sampler/cRT (1.3-B), +2021 data (2.8). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- train_advanced.sh: activate spring-2026-pyt (academic-ml/spring-2026) instead of faridkar's personal herb_env, which tgardos does not have. - submit_concrete.sh: fix env comment to match. - CONCRETE_RUN_README.md: add one-time environment setup section — `pip install --user evaluate wandb` (the two deps missing from spring-2026-pyt), the pydantic_core user-site shadowing fix, the evaluate metric-cache note, and the gpu_c>=7.0 requirement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Document team membership requirement, wandb login --relogin (stale-key fix), verification, and offline/skip options. No code change — the trainer already inits W&B from the config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…est-model - _relocate_output_dir(): rewrite any .../workspaces/<author>/herbdl prefix in output_dir/logging_dir to this checkout's repo root, preserving the run name, so every config writes under the runner's own workspace instead of faridkar's hardcoded paths. No YAML edits needed; logs the rewrite; HERBDL_NO_RELOCATE=1 opts out. - Forward gradient_checkpointing, gradient_checkpointing_kwargs, and load_best_model_at_end from config to TrainingArguments (the allow-list was dropping them, so the concrete 384 config's gradient_checkpointing was a no-op and would likely OOM). - README: document output-path relocation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ault Previously a bare `qsub train_advanced.sh` silently fell back to hyperparameter_configs/swin_base_cosine_lr1e4_warmup.yml (a SWIN-base 224 sweep config) — a footgun easily mistaken for the intended run. Now error out if CONFIG_FILE is unset or points at a missing file. submit_concrete.sh always sets it explicitly, so it is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Farid-Karimli · 2026-06-05T01:30:47Z

I like the updates -- curious to see what effect balanced softmax has.

TrainingArguments read config['training']['eval_steps'] with a hard index, so any config using eval_strategy: "epoch" (e.g. swin_large_384_concrete.yml) crashed before training. Use .get(..., None); eval_steps is only consulted when eval_strategy == "steps". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…te note Document how to babysit a run from iPhone/MacBook via Claude Code Remote Control (session runs on the SCC login node with qsub access, unlike the cloud-sandbox web product), including the SCC-specific update path for the npx --prefix ~/claude-code install (claude update targets the read-only shared module prefix and silently no-ops; use npm install --prefix instead). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

trgardos and others added 5 commits June 3, 2026 22:39

trgardos requested a review from Farid-Karimli June 4, 2026 16:41

trgardos marked this pull request as draft June 4, 2026 16:41

trgardos and others added 3 commits June 5, 2026 15:45

missing -v arg prefix

056e2a5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SWIN finetuning: domain-gap closing Tier 1/2 (balanced softmax, EMA, flip TTA) + SWIN-L 384 concrete run#36

SWIN finetuning: domain-gap closing Tier 1/2 (balanced softmax, EMA, flip TTA) + SWIN-L 384 concrete run#36
trgardos wants to merge 9 commits into
finetuningfrom
finetuning-trg

trgardos commented Jun 4, 2026

Uh oh!

Farid-Karimli commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

trgardos commented Jun 4, 2026

Summary

New trainer features (all config-gated, default-off — existing configs unchanged)

The concrete run

Fixes surfaced along the way

Deferred (future ensemble members, intentionally out of scope)

Validation

Uh oh!

Farid-Karimli commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants