Skip to content

SWIN finetuning: domain-gap closing Tier 1/2 (balanced softmax, EMA, flip TTA) + SWIN-L 384 concrete run#36

Draft
trgardos wants to merge 9 commits into
finetuningfrom
finetuning-trg
Draft

SWIN finetuning: domain-gap closing Tier 1/2 (balanced softmax, EMA, flip TTA) + SWIN-L 384 concrete run#36
trgardos wants to merge 9 commits into
finetuningfrom
finetuning-trg

Conversation

@trgardos
Copy link
Copy Markdown
Contributor

@trgardos trgardos commented Jun 4, 2026

Summary

Implements the Tier 1 & Tier 2 "concrete next run" from SWIN_training_setup_summary.md (§10 recommendations / §11 step-by-step) on top of faridkar's SWIN finetuning framework, scoped to stay within SWIN-L 384 (no backbone swap). Goal: close the macro-F1 gap on the ~15.5k-class long tail.

The branch is 5 commits, structured so the framework sync is isolated from the new work:

  1. 34045dd Sync framework from faridkar — brings this repo's finetuning/SWIN/ up to faridkar's current trainer (integrated ArcFace + non-backbone overlay, multi-task, mixup/cutmix, multi-crop TTA), plus configs_advanced/ (curriculum + seed ensemble), hyperparameter_configs/, launch_sweep.py, and reference docs. No tgardos edits in this commit — verbatim, so the rest of the PR is a clean diff.
  2. c20acfa §10/§11 Tier 1–2 features + concrete config (see below).
  3. 956f6b5 Environment — use the spring-2026-pyt conda env (not faridkar's personal herb_env); document pip install --user evaluate wandb.
  4. e1f946c W&B setup docs — gardoslab/herbdl logging.
  5. 6bd0636 Output-path auto-relocation + TrainingArguments fix.

New trainer features (all config-gated, default-off — existing configs unchanged)

Feature Section Config key
Balanced softmax / logit adjustment (species log-prior added to logits at train time only) 1.3-A long_tail.logit_adjustment / tau
Weight EMA (shadow avg copied into model at train end → final eval/save use EMA) 2.6 ema.enabled / decay
Horizontal-flip TTA (avg over crops × {orig, flip}) 2.7 multi_crop.flip
gradient_checkpointing_enable/disable passthrough on the multi-task/ArcFace wrappers training.gradient_checkpointing

Balanced softmax is applied in MixupTrainer.compute_loss for single-task and multi-task, in both the mixup and non-mixup paths, and is off at inference (plain argmax) — the standard balanced-softmax recipe for lifting macro-F1 on rare classes.

The concrete run

  • configs_advanced/swin_large_384_concrete.yml — SWIN-L @384 (ImageNet-22k), multi-task + balanced softmax + EMA, medium (cold-start-safe) augmentation, 100 epochs, TTA block ready for inference.
  • submit_concrete.sh — A100-80G seed-ensemble launcher (distinct run_id/output_dir per seed via --set). Nothing is auto-submitted.
  • CONCRETE_RUN_README.md — run/feature docs, one-time env setup (pip + W&B), smoke test, warm-start, OOM tuning, TTA inference.

Fixes surfaced along the way

  • Output paths auto-relocate: configs hardcode .../workspaces/faridkar/herbdl/...; the trainer now rewrites any .../workspaces/<author>/herbdl prefix to the workspace it's actually run from (preserving the run name), so every config writes under the runner's own workspace with no YAML edits. HERBDL_NO_RELOCATE=1 opts out.
  • TrainingArguments allow-list was silently dropping gradient_checkpointing and load_best_model_at_end — now forwarded, so the 384 config's gradient checkpointing actually takes effect (it would otherwise OOM).

Deferred (future ensemble members, intentionally out of scope)

Domain-pretrained backbone swap (Tier 1.1 timm/open_clip), warmed-up ArcFace rescue (Tier 2.4), class-balanced sampler / two-stage cRT (Tier 1.3-B), and +2021 data (Tier 2.8).

Validation

py_compile + full module import under spring-2026-pyt; all configs_advanced/*.yml parse; --set override flow and output-path relocation unit-tested (faridkar→tgardos rewrite, local/unrelated paths untouched, opt-out honored). No GPU training run yet — recommend the README smoke test before launching the 48h jobs.

🤖 Generated with Claude Code

trgardos and others added 5 commits June 3, 2026 22:39
Snapshot of faridkar's current finetuning/SWIN before applying the
section 10/11 (Tier 1 & 2) changes. Brings tgardos's copy up to date:

- SWIN_finetuning_advanced.py: integrated ArcFace + non-backbone overlay,
  multi-task, mixup/cutmix, multi-crop TTA (now identical to faridkar's)
- configs_advanced/: curriculum chain, seed ensemble, augmented variants
- hyperparameter_configs/: 12-cell LR sweep
- launch_sweep.py, submit_pretrained_seeds.sh, train_advanced.sh (SET_ARGS)
- CURRICULUM_REPORT.md and other reference docs

No tgardos-authored modifications yet; this is faridkar's code verbatim
so the subsequent section-11 work is a clean, reviewable diff.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… run

Implements the "concrete next run" from SWIN_training_setup_summary.md
(§10 recommendations / §11 step-by-step), scoped to stay within SWIN-L 384.

Trainer (SWIN_finetuning_advanced.py), all config-gated and default-off so
existing configs are unchanged:
- Balanced softmax / logit adjustment (Tier 1.3-A): new `long_tail` section;
  per-class log-prior added to species logits during training only (off at
  inference), in single-task and multi-task, mixup and non-mixup paths.
- Weight EMA (Tier 2.6): new `ema` section + EMACallback; EMA weights copied
  into the model at train end so final eval/save reflect them.
- Horizontal-flip TTA (Tier 2.7): build_multi_crop_transforms(flip=...) emits
  flipped crop variants; gated by `multi_crop.flip`.
- gradient_checkpointing_enable/disable passthrough on MultiTaskSwinModel and
  SwinWithArcFace so the flag works for the wrapped models (fits SWIN-L @384).

Experiment:
- configs_advanced/swin_large_384_concrete.yml — SWIN-L 384 in22k, multi-task
  + balanced softmax + EMA, medium (cold-safe) aug, 100ep, TTA block.
- submit_concrete.sh — A100-80G seed-ensemble launcher (no auto-submit).
- CONCRETE_RUN_README.md — run/feature docs, smoke test, warm-start, TTA.

Deferred per chosen scope: backbone swap (1.1), ArcFace rescue (2.4),
class-balanced sampler/cRT (1.3-B), +2021 data (2.8).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- train_advanced.sh: activate spring-2026-pyt (academic-ml/spring-2026)
  instead of faridkar's personal herb_env, which tgardos does not have.
- submit_concrete.sh: fix env comment to match.
- CONCRETE_RUN_README.md: add one-time environment setup section —
  `pip install --user evaluate wandb` (the two deps missing from
  spring-2026-pyt), the pydantic_core user-site shadowing fix, the
  evaluate metric-cache note, and the gpu_c>=7.0 requirement.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document team membership requirement, wandb login --relogin (stale-key
fix), verification, and offline/skip options. No code change — the
trainer already inits W&B from the config.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…est-model

- _relocate_output_dir(): rewrite any .../workspaces/<author>/herbdl prefix in
  output_dir/logging_dir to this checkout's repo root, preserving the run name,
  so every config writes under the runner's own workspace instead of faridkar's
  hardcoded paths. No YAML edits needed; logs the rewrite; HERBDL_NO_RELOCATE=1
  opts out.
- Forward gradient_checkpointing, gradient_checkpointing_kwargs, and
  load_best_model_at_end from config to TrainingArguments (the allow-list was
  dropping them, so the concrete 384 config's gradient_checkpointing was a no-op
  and would likely OOM).
- README: document output-path relocation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@trgardos trgardos requested a review from Farid-Karimli June 4, 2026 16:41
@trgardos trgardos marked this pull request as draft June 4, 2026 16:41
…ault

Previously a bare `qsub train_advanced.sh` silently fell back to
hyperparameter_configs/swin_base_cosine_lr1e4_warmup.yml (a SWIN-base 224
sweep config) — a footgun easily mistaken for the intended run. Now error
out if CONFIG_FILE is unset or points at a missing file. submit_concrete.sh
always sets it explicitly, so it is unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Farid-Karimli
Copy link
Copy Markdown
Collaborator

I like the updates -- curious to see what effect balanced softmax has.

trgardos and others added 3 commits June 5, 2026 15:45
TrainingArguments read config['training']['eval_steps'] with a hard index,
so any config using eval_strategy: "epoch" (e.g. swin_large_384_concrete.yml)
crashed before training. Use .get(..., None); eval_steps is only consulted
when eval_strategy == "steps".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…te note

Document how to babysit a run from iPhone/MacBook via Claude Code Remote
Control (session runs on the SCC login node with qsub access, unlike the
cloud-sandbox web product), including the SCC-specific update path for the
npx --prefix ~/claude-code install (claude update targets the read-only
shared module prefix and silently no-ops; use npm install --prefix instead).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants