SWIN finetuning: domain-gap closing Tier 1/2 (balanced softmax, EMA, flip TTA) + SWIN-L 384 concrete run#36
Draft
trgardos wants to merge 9 commits into
Draft
SWIN finetuning: domain-gap closing Tier 1/2 (balanced softmax, EMA, flip TTA) + SWIN-L 384 concrete run#36trgardos wants to merge 9 commits into
trgardos wants to merge 9 commits into
Conversation
Snapshot of faridkar's current finetuning/SWIN before applying the section 10/11 (Tier 1 & 2) changes. Brings tgardos's copy up to date: - SWIN_finetuning_advanced.py: integrated ArcFace + non-backbone overlay, multi-task, mixup/cutmix, multi-crop TTA (now identical to faridkar's) - configs_advanced/: curriculum chain, seed ensemble, augmented variants - hyperparameter_configs/: 12-cell LR sweep - launch_sweep.py, submit_pretrained_seeds.sh, train_advanced.sh (SET_ARGS) - CURRICULUM_REPORT.md and other reference docs No tgardos-authored modifications yet; this is faridkar's code verbatim so the subsequent section-11 work is a clean, reviewable diff. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… run Implements the "concrete next run" from SWIN_training_setup_summary.md (§10 recommendations / §11 step-by-step), scoped to stay within SWIN-L 384. Trainer (SWIN_finetuning_advanced.py), all config-gated and default-off so existing configs are unchanged: - Balanced softmax / logit adjustment (Tier 1.3-A): new `long_tail` section; per-class log-prior added to species logits during training only (off at inference), in single-task and multi-task, mixup and non-mixup paths. - Weight EMA (Tier 2.6): new `ema` section + EMACallback; EMA weights copied into the model at train end so final eval/save reflect them. - Horizontal-flip TTA (Tier 2.7): build_multi_crop_transforms(flip=...) emits flipped crop variants; gated by `multi_crop.flip`. - gradient_checkpointing_enable/disable passthrough on MultiTaskSwinModel and SwinWithArcFace so the flag works for the wrapped models (fits SWIN-L @384). Experiment: - configs_advanced/swin_large_384_concrete.yml — SWIN-L 384 in22k, multi-task + balanced softmax + EMA, medium (cold-safe) aug, 100ep, TTA block. - submit_concrete.sh — A100-80G seed-ensemble launcher (no auto-submit). - CONCRETE_RUN_README.md — run/feature docs, smoke test, warm-start, TTA. Deferred per chosen scope: backbone swap (1.1), ArcFace rescue (2.4), class-balanced sampler/cRT (1.3-B), +2021 data (2.8). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- train_advanced.sh: activate spring-2026-pyt (academic-ml/spring-2026) instead of faridkar's personal herb_env, which tgardos does not have. - submit_concrete.sh: fix env comment to match. - CONCRETE_RUN_README.md: add one-time environment setup section — `pip install --user evaluate wandb` (the two deps missing from spring-2026-pyt), the pydantic_core user-site shadowing fix, the evaluate metric-cache note, and the gpu_c>=7.0 requirement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document team membership requirement, wandb login --relogin (stale-key fix), verification, and offline/skip options. No code change — the trainer already inits W&B from the config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…est-model - _relocate_output_dir(): rewrite any .../workspaces/<author>/herbdl prefix in output_dir/logging_dir to this checkout's repo root, preserving the run name, so every config writes under the runner's own workspace instead of faridkar's hardcoded paths. No YAML edits needed; logs the rewrite; HERBDL_NO_RELOCATE=1 opts out. - Forward gradient_checkpointing, gradient_checkpointing_kwargs, and load_best_model_at_end from config to TrainingArguments (the allow-list was dropping them, so the concrete 384 config's gradient_checkpointing was a no-op and would likely OOM). - README: document output-path relocation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ault Previously a bare `qsub train_advanced.sh` silently fell back to hyperparameter_configs/swin_base_cosine_lr1e4_warmup.yml (a SWIN-base 224 sweep config) — a footgun easily mistaken for the intended run. Now error out if CONFIG_FILE is unset or points at a missing file. submit_concrete.sh always sets it explicitly, so it is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
|
I like the updates -- curious to see what effect balanced softmax has. |
TrainingArguments read config['training']['eval_steps'] with a hard index, so any config using eval_strategy: "epoch" (e.g. swin_large_384_concrete.yml) crashed before training. Use .get(..., None); eval_steps is only consulted when eval_strategy == "steps". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…te note Document how to babysit a run from iPhone/MacBook via Claude Code Remote Control (session runs on the SCC login node with qsub access, unlike the cloud-sandbox web product), including the SCC-specific update path for the npx --prefix ~/claude-code install (claude update targets the read-only shared module prefix and silently no-ops; use npm install --prefix instead). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the Tier 1 & Tier 2 "concrete next run" from
SWIN_training_setup_summary.md(§10 recommendations / §11 step-by-step) on top of faridkar's SWIN finetuning framework, scoped to stay within SWIN-L 384 (no backbone swap). Goal: close the macro-F1 gap on the ~15.5k-class long tail.The branch is 5 commits, structured so the framework sync is isolated from the new work:
34045ddSync framework from faridkar — brings this repo'sfinetuning/SWIN/up to faridkar's current trainer (integrated ArcFace + non-backbone overlay, multi-task, mixup/cutmix, multi-crop TTA), plusconfigs_advanced/(curriculum + seed ensemble),hyperparameter_configs/,launch_sweep.py, and reference docs. No tgardos edits in this commit — verbatim, so the rest of the PR is a clean diff.c20acfa§10/§11 Tier 1–2 features + concrete config (see below).956f6b5Environment — use thespring-2026-pytconda env (not faridkar's personalherb_env); documentpip install --user evaluate wandb.e1f946cW&B setup docs — gardoslab/herbdl logging.6bd0636Output-path auto-relocation +TrainingArgumentsfix.New trainer features (all config-gated, default-off — existing configs unchanged)
long_tail.logit_adjustment/tauema.enabled/decaymulti_crop.flipgradient_checkpointing_enable/disablepassthrough on the multi-task/ArcFace wrapperstraining.gradient_checkpointingBalanced softmax is applied in
MixupTrainer.compute_lossfor single-task and multi-task, in both the mixup and non-mixup paths, and is off at inference (plain argmax) — the standard balanced-softmax recipe for lifting macro-F1 on rare classes.The concrete run
configs_advanced/swin_large_384_concrete.yml— SWIN-L @384 (ImageNet-22k), multi-task + balanced softmax + EMA, medium (cold-start-safe) augmentation, 100 epochs, TTA block ready for inference.submit_concrete.sh— A100-80G seed-ensemble launcher (distinctrun_id/output_dirper seed via--set). Nothing is auto-submitted.CONCRETE_RUN_README.md— run/feature docs, one-time env setup (pip + W&B), smoke test, warm-start, OOM tuning, TTA inference.Fixes surfaced along the way
.../workspaces/faridkar/herbdl/...; the trainer now rewrites any.../workspaces/<author>/herbdlprefix to the workspace it's actually run from (preserving the run name), so every config writes under the runner's own workspace with no YAML edits.HERBDL_NO_RELOCATE=1opts out.TrainingArgumentsallow-list was silently droppinggradient_checkpointingandload_best_model_at_end— now forwarded, so the 384 config's gradient checkpointing actually takes effect (it would otherwise OOM).Deferred (future ensemble members, intentionally out of scope)
Domain-pretrained backbone swap (Tier 1.1 timm/open_clip), warmed-up ArcFace rescue (Tier 2.4), class-balanced sampler / two-stage cRT (Tier 1.3-B), and +2021 data (Tier 2.8).
Validation
py_compile+ full module import underspring-2026-pyt; allconfigs_advanced/*.ymlparse;--setoverride flow and output-path relocation unit-tested (faridkar→tgardos rewrite, local/unrelated paths untouched, opt-out honored). No GPU training run yet — recommend the README smoke test before launching the 48h jobs.🤖 Generated with Claude Code